GPU Acceleration of CFD
First, we optimized the overall HPC application by removing several bottlenecks, redesigning it to make the most of the given hardware architecture and optimizing things like memory usage, calculations and communication channels. Then we used artificial intelligence to tweak it even further and ensure that it reaches the highest possible performance that the underlying CPU + GPU architecture could provide.
12x performance boost, reduced energy consumption by 30%.
Besides that, our solution permits to provide an extensive overlapping GPU computations and data transfers, both between computational nodes, as well as between the GPU accelerator and CPU host within a node. In this particular case, we decomposed a computational domain into two unequal parts which corresponded to either data dependent or data independent parts. That allowed for data transfers being performed simultaneously with computations corresponding to the second part. In that configuration, we reached 16.372 Tflop/s using 136 GPUs.
Also, our approach to the adaptation of the 3D MPDATA to GPU architectures allowed us to achieve up to 482.5 Gflop/s for the platform equipped with two NVIDIA Tesla K80 GPUs, 514 Gflop/s using a single NVIDIA Tesla P100 GPU and up to 16.1 TFlop/s on the Piz Daint cluster using 136 NVIDIA Tesla K20x GPUs. It is currently the highest performance result for the 3D MPDATA obtained on a single node that can be found in the scientific literature.
Benefits for the customer
- HPC application making the most of all new hardware features (=no leaks in investment)
- Optimized software (better performance + lower energy consumption = savings in terms of money and time + frustration-free environment)
- Access to innovations (byteLAKE team strives to bring the very best of the academic research developments into business)
- Beyond standard optimizations, the solution designed by byteLAKE ensures that the algorithm parameters are tweaked online
During the project, we created a software automatic tuning (autotuning for short) toolbox. It enables software automatic adaptation to a variety of computational conditions. The concept itself has its origins in the stream of research works related to HPC. It is considered as one of the most promising approaches to achieve the performance advancements on the next generation supercomputing platforms.
Short background: autotuning has been used extensively on CPUs to automatically generate near optimal numerical libraries. For example, ATLAS (Automatically Tuned Linear Algebra Software) and PHiPAC (Portable High Performance ANSI C) use auto-tuning to generate highly optimized version of BLAS (Basic Linear Algebra Subprograms). Also, efforts related to the autotuning of CUDA kernels for NVIDIA GPUs have shown that this technique is a very practical approach i.e. when porting existing algorithmic solutions on quickly evolving GPU architectures. What is more, autotuning substantially speeds up even highly tuned hand-written kernels.
Our solution focuses on the performance and energy consumption optimizations. It utilizes techniques like mixed precision arithmetic and manages them dynamically using machine learning (a modified version of the random forest algorithm in our case).
To optimize the performance, we bundled up the following methods:
- reduced the number of operations by the subexpression elimination when implementing 2.5D blocking;
- reorganized the boundary conditions for reducing branch instructions;
- advanced memory management to increase the coalesced memory access;
- and rearranged warps for optimizing the data access to GPU global memory.
That let us efficiently use many graphics processors within a single node by applying a peer-to-peer data transfer between GPU global memories. In addition, we took into account both: algorithm and GPU-specific parameters when optimizing the overall software.
- power consumption measurement module
It measures the power consumption as a function of the frequency of the processor and the number of cores. Our method uses only a very reduced set of real power measures on the CPU-based platform. We tested the estimates it produced using two real scientific applications: 3D MPDATA and the conjugate gradient (CG) method for sparse linear systems on a variety of ARM and Intel architectures. We reached the average error rate slightly below 5% in all cases while comparing with the real measurements. Also, our method produced accurate estimates for any number of cores and voltage–frequency configurations based on a very reduced number of samples (i.e. 5 real measurements out of 60+ possible). It is important to note, however, that we do not rely on hardware counters or fine-grain measurements. Instead, we rely only on the practical execution of a few steps of the target iterative application and the reads from coarse-grain watt meters. In contrast to the other approaches, we do not require samples for the full range of frequency configurations and/or number of threads. Instead, we only count on measurements for a few selected cases to derive the behavior of the application’s power consumption. In other words, our method starts from the simple power models and converges rapidly to a stable solution in the form of a table containing the sought-after estimations.
- energy-aware task management module
Designed specifically for the forward-in-time algorithms running on multicore central processing units (CPUs). Our mechanism is based on a dynamic voltage and frequency scaling technique. It enables the reduction of the energy consumption for an existing algorithm (or application) keeping the predefined execution time. Moreover, it does not require any modifications to the algorithm itself. It also utilizes the principles of the adaptive scheduling with online modeling to minimize the energy consumption while keeping the given time constraints. And finally, we automated the process of creation and determination of the best energy profile at a runtime, even in the presence of additional CPU workloads. The experimental results on a 6-core computing platform have proven that our mechanism provides the energy savings of up to 1.43x comparing to the default Linux scaling governor. Also, we confirmed the effectiveness of the self-adaptive feature of our mechanism, by showing its ability to maintain the requested execution time despite additional CPU workloads imposed by other applications.
- mixed precision arithmetics applied dynamically by machine learning
We have tested the effectiveness of our solution and validated it with a real-life scientific application called MPDATA.
Short background: MPDATA is a part of the numerical model used in weather forecast simulations.
We deployed the results on Piz Daint supercomputer (ranked 3rd at the TOP500 list as of Nov. 2017). Overall the performance soared up by 12 times (compared to the original version in Fortran). Software autotuning solution contributed to the speed-up of 1.32 and the reduction of the energy consumption by about 34%.
Besides that, we have successfully validated the results on NVIDIA Kepler-based GPUs including Tesla K20X, GeForce GTX TITAN, a single Tesla K80 GPU, and multi-GPU system with two K80 cards, as well as GeForce GTX 980 GPU based on the NVIDIA Maxwell architecture.
2 socket CPU with 2xIntel Xeon E5 2695 v2 2.4GHz (2x12cores) vs: NVIDIA GeForce GTX TITAN (Kepler) based resulted in a 1.29 speedup at GPU. For P100 Pascal the speedup reached the value of 2.5.