Abstract:
Various optimizations of the ART software package for the solution of the radiative transfer equation in three dimensions are discussed in this white paper. All critical path functions of the code have been optimized and vectorized using OpenMP directives. Several techniques have been used, amongst others the rearrangement of input data and internal data structures to facilitate usage of CPU vector units, vectorization of calls to the math library, explicit loop unrolling to allow vectorization of iterative loops with a convergence criterion (...)
Various optimizations of the ART software package for the solution of the radiative transfer equation in three dimensions are discussed in this white paper. All critical path functions of the code have been optimized and vectorized using OpenMP directives. Several techniques have been used, amongst others the rearrangement of input data and internal data structures to facilitate usage of CPU vector units, vectorization of calls to the math library, explicit loop unrolling to allow vectorization of iterative loops with a convergence criterion, vectorization of data-dependent if-statements through enforced computations on all SIMD lanes and filtering of the final result. Several technical challenges had to be overcome to achieve the best performance. The OpenMPI stack needed to be compiled with a custom (non-native) Glibc library. In some cases, individual vectorized clones generated automatically by the compilers needed to be substituted with custom functions implemented manually using compiler intrinsics. Performance tests have shown that on the Broadwell architecture the optimized code works from 2.5x faster (RT solver) to 13x faster (EOS solver) on a single core. MPI implementation of the code scales with 95% efficiency on 2048 cores. Throughout the project, several GCC bugs related to automatic OpenMP vectorization have been reported, which shows that the support for the relatively new OpenMP vectorization features is still not mature. For some of those bugs effective workarounds have been developed. We also point to some shortcomings in the OpenMP simd vectorization framework and develop several new optimization techniques, which improve effectiveness of automatic code vectorization. Finally, some generally useful tutorials have been delivered.
(Read More)