GPU Porting and Optimization of the RegCM5

We implemented and tested various optimizations and GPU offloading strategies on the coupled configuration of the RegCM5 regional climate model based on OpenACC and Fortran standard language parallelism through do concurrent. For an idealized, uncoupled, no-IO configuration, in which RegCM5 simulates a free-running atmosphere, initial benchmarks showed that a GPU enabled run on 2 Leonardo Booster nodes using 8 A100 GPUs and 8 CPU cores outperformed the baseline CPU-only configuration running on 8 nodes of the Leonardo DCGP partition with 896 CPU cores. However, when RegCM5 was coupled with the Community Land Model CLM4.5 for a scientifically realistic production run, we found that the number of GPUs required to outperform the CPU-only configuration running on 800 CPU cores increased to 64. Initial profiling confirmed that the delayed crossover point, at which GPU-enabled runs begin to outperform the CPU-only baseline in the coupled configuration, is caused by several code paths that are activated only in this mode. The first performance hotspot accounted for 20% of the initial total wall-clock time and originated from the noncontiguous row-slice initialization of two dimensional arrays in the hydrology tracer routine inside the module mod_clm_hydrology2. These whole-array initializations are lowered by the nvfortran compiler into vectorized memory set operation under the symbol __c_memset_avx. To eliminate this hotspot, we replaced the implicit array operation with an explicit loop. This optimization yielded a 1.77× speedup for the GPU-enabled run. The next hotspot accounted for 54% of the resulting wall-clock time was associated with the thermodynamic solver used to compute the Convective Available Potential Energy (Cape) and Convective Inhibition (CIN). The CAPE/CIN post-processing diagnostic was ported to GPU by refactoring the original column-based routine into an accelerator-compatible version using explicit work arrays and by restructuring the output code so that each atmospheric column can be processed independently. This preserves the original physical formulation while exposing fine-grained parallelism suitable for GPU execution. We applied the same porting strategy to the call sites of two additional procedures interp1d_r8 and heatindex which together contributed another 20% of the wall-clock time. These optimizations and porting strategies produced an overall 7.10× speedup relative to the initial code. As a result, the GPU-enabled run using 16 GPUs on Booster now runs 2.4× faster than the CPU-only configuration using 800 CPU cores on DCGP. Additional optimizations and porting efforts, not discussed in the present work further increased the total speedup to 14.17× relative to the initial version. We also studied the performance portability of the coupled configurations across two other platforms running on H100 GPUs. We observed that the coupled configuration exhibited the same superlinear scaling across three computing platforms the origin of which needs to be investigated in future work. Ensuring that the GPU ported code reproduces the CPU-only results at the bitwise level is not yet part of the present work. At this stage, we observed only statistical reproducibility between the ported and CPU-only runs. After 7 model days, the numerical divergence remained below 0.1% for the near-surface air temperature and surface pressure while for the near-surface specific humidity remained within 5%

GPU Porting and Optimization of the RegCM5(2026 Mar 27).