Abstract
The optimization of performance of complex simulation codes with high
computational demands, such as Octo-Tiger, is an ongoing challenge. Octo-Tiger
is an astrophysics code simulating the evolution of star systems based on the
fast multipole method on adaptive octrees. It was implemented using high-level
C++ libraries, specifically HPX and Vc, which allows its use on different
hardware platforms. Recently, we have demonstrated excellent scalability in a
distributed setting. In this paper, we study Octo-Tiger’s node-level
performance on an Intel Knights Landing platform. We focus on the fast
multipole method, as it is Octo-Tiger’s computationally most demanding
component. By using HPX and a futurization approach, we can efficiently
traverse the adaptive octrees in parallel. On the core-level, threads process
sub-grids using multiple 743-element stencils. In numerical experiments,
simulating the time evolution of a rotating star on an Intel Xeon Phi 7250
Knights Landing processor, Octo-Tiger shows good parallel efficiency and
achieves up to 408 GFLOPS. This results in a speedup of 2x compared to a
24-core Skylake-SP platform, using the same high-level abstractions.
Users
Please
log in to take part in the discussion (add own reviews or comments).