High performance computing is required to make feasible simulations of whole organ models of the heart with biophysically detailed cellular models in a clinical setting. Increasing model detail by simulating electrophysiology and mechanical models increases computation demands. We present scaling results of an electro mechanical cardiac model of two ventricles and compare them to our previously published results using an electrophysiological model only. The anatomical data-set was given by both ventricles of the Visible Female data-set in a 0.2 mm resolution. Fiber orientation was included. Data decomposition for the distribution onto the distributed memory system was carried out by orthogonal recursive bisection. Load weight ratios for non tissue vs. tissue elements used in the data decomposition were 1:1, 1:2, 1:5, 1:10, 1:25, 1:38.85, 1:50 and 1:100. The ten Tusscher et al. (2004) electrophysiological cell model was used and the Rice et al. (1999) model for the computation of the calcium transient dependent force. Scaling results for 512, 1024, 2048, 4096, 8192 and 16,384 processors were obtained for 1 ms simulation time. The simulations were carried out on an IBM Blue Gene/L supercomputer. The results show linear scaling from 512 to 16,384 processors with speedup factors between 1.82 and 2.14 between partitions. The most optimal load ratio was 1:25 for on all partitions. However, a shift towards load ratios with higher weight for the tissue elements can be recognized as can be expected when adding computational complexity to the model while keeping the same communication setup. This work demonstrates that it is potentially possible to run simulations of 0.5 s using the presented electro-mechanical cardiac model within 1.5 hours.
Multi-scale, multi-physical heart models have not yet been able to include a high degree of accuracy and resolution with respect to model detail and spatial resolution due to computational limitations of current systems. We propose a framework to compute large scale cardiac models. Decomposition of anatomical data in segments to be distributed on a parallel computer is carried out by optimal recursive bisection (ORB). The algorithm takes into account a computational load parameter which has to be adjusted according to the cell models used. The diffusion term is realized by the monodomain equations. The anatomical data-set was given by both ventricles of the Visible Female data-set in a 0.2 mm resolution. Heterogeneous anisotropy was included in the computation. Model weights as input for the decomposition and load balancing were set to (a) 1 for tissue and 0 for non-tissue elements; (b) 10 for tissue and 1 for non-tissue elements. Scaling results for 512, 1024, 2048, 4096 and 8192 computational nodes were obtained for 10 ms simulation time. The simulations were carried out on an IBM Blue Gene/L parallel computer. A 1 s simulation was then carried out on 2048 nodes for the optimal model load. Load balances did not differ significantly across computational nodes even if the number of data elements distributed to each node differed greatly. Since the ORB algorithm did not take into account computational load due to communication cycles, the speedup is close to optimal for the computation time but not optimal overall due to the communication overhead. However, the simulation times were reduced form 87 minutes on 512 to 11 minutes on 8192 nodes. This work demonstrates that it is possible to run simulations of the presented detailed cardiac model within hours for the simulation of a heart beat.
Increasing biophysical detail in multi physical, multiscale cardiac model will demand higher levels of parallelism in multi-core approaches to obtain fast simulation times. As an example of such a highly parallel multi-core approaches, we develop a completely distributed bidomain cardiac model implemented on the IBM Blue Gene/L architecture. A tissue block of size 50 times 50 times 100 cubic elements based on ten Tusscher et al. (2004) cell model is distributed on 512 computational nodes. The extracellular potential is calculated by the Gauss-Seidel (GS) iterative method that typically requires high levels of inter-processor communication. Specifically, the GS method requires knowledge of all cellular potentials at each of it iterative step. In the absence of shared memory, the values are communicated with substantial overhead. We attempted to reduce communication overhead by computing the extracellular potential only every 5th time step for the integration of the cell models. We also investigated the effects of reducing inter-processor communication to every 5th, 10th, 50th iteration or no communication within the GS iteration. While technically incorrect, these approximation had little impact on numerical convergence or accuracy for the simulations tested. The results suggest some heuristic approaches may further reduce the inter-processor communication to improve the execution time of large-scale simulations.
Orthogonal recursive bisection (ORB) algorithm can be used as data decomposition strategy to distribute a large data set of a cardiac model to a distributed memory supercomputer. It has been shown previously that good scaling results can be achieved using the ORB algorithm for data decomposition. However, the ORB algorithm depends on the distribution of computational load of each element in the data set. In this work we investigated the dependence of data decomposition and load balancing on different rotations of the anatomical data set to achieve optimization in load balancing. The anatomical data set was given by both ventricles of the Visible Female data set in a 0.2 mm resolution. Fiber orientation was included. The data set was rotated by 90 degrees around x, y and z axis, respectively. By either translating or by simply taking the magnitude of the resulting negative coordinates we were able to create 14 data sets of the same anatomy with different orientation and position in the overall volume. Computation load ratios for non tissue vs. tissue elements used in the data decomposition were 1:1, 1:2, 1:5, 1:10, 1:25, 1:38.85, 1:50 and 1:100 to investigate the effect of different load ratios on the data decomposition. The ten Tusscher et al. (2004) electrophysiological cell model was used in monodomain simulations of 1 ms simulation time to compare performance using the different data sets and orientations. The simulations were carried out for load ratio 1:10, 1:25 and 1:38.85 on a 512 processor partition of the IBM Blue Gene/L supercomputer. The results show that the data decomposition does depend on the orientation and position of the anatomy in the global volume. The difference in total run time between the data sets is 10 s for a simulation time of 1 ms. This yields a difference of about 28 h for a simulation of 10 s simulation time. However, given larger processor partitions, the difference in run time decreases and becomes less significant. Depending on the processor partition size, future work will have to consider the orientation of the anatomy in the global volume for longer simulation runs.