OGSTM-BFM is a coupled physical-biogeochemical model developed at the National Institute of Oceanography and Applied Geophysics (OGS) [1, 2, 3, 4] and is used for climate-related studies. Recent work within ESiWACE3 (Centre of Excellence in Simulation of Weather and Climate in Europe) reported substantial performance gains after porting the model to GPU architectures [5]. This thesis presents a reproducibility study of selected GPU performance results of OGSTM-BFM on the Leonardo supercomputer, as well as further investigations that were not included in previous ESiWACE3 reports. A signi cant acceleration can be achieved when using GPUs, provided that appropriate MPI rank-to-GPU mappings and Multi-Process Service (MPS) con gurations are employed. Several con gurations were tested. When we consider the GPU version running on two nodes, with four NVIDIA A100 GPUs per node and four MPI ranks per GPU, compared to two dual-socket nodes with Intel Sapphire Rapids CPUs, this leads to a speedup of 1.64. Once MPS is enabled, performance increases dramatically (speedup of 5.72). Di erent mappings of the ranks to the GPUs were tested. It was found that round-robin mapping combined with 50% MPS (each rank limited to ∼ 50% GPU threads), further increases the speedup to 5.97. Lastly, after adding NUMA binding in the MPS launch path, we managed to achieve a speedup of 6.22. We note that a speedup of 7.41 was reported on the ESiWACE3 Technical Report [5]; however, in this study, we were not able to actually reach this result. In addition, this work evaluates an alternative implementation of the vertical dif fusion tridiagonal solver using NVIDIA's cuSPARSE batched routines. The original implementation solved a system with a tridiagonal matrix by a method that does not allow for full parallelism. The algorithm, however, is highly specialized to tridi agonal matrices. We developed benchmark tests both in isolation and integrated into OGSTM-BFM. The cuSPARSE-based variant in isolation is about 3.33 times slower than the specialized version. When integrated into OGSTM-BFM it leads to 8% slower runs than the baseline in the tested con guration. These results emphasize that increased parallelism alone does not guarantee im proved time-to-solution; they also show that launch-level tuning (MPS, mapping, NUMAplacement) is as important as kernel-level optimization. The thesis provides practical insights into GPU reproducibility, solver integration, and performance en gineering for large-scale scienti c applications.

Performance Analysis and GPU Scalability of OGSTM-BFM / Feitosa Benevides, Andre'. - (2026 Mar 27).

Performance Analysis and GPU Scalability of OGSTM-BFM

FEITOSA BENEVIDES, ANDRE'
2026-03-27

Abstract

OGSTM-BFM is a coupled physical-biogeochemical model developed at the National Institute of Oceanography and Applied Geophysics (OGS) [1, 2, 3, 4] and is used for climate-related studies. Recent work within ESiWACE3 (Centre of Excellence in Simulation of Weather and Climate in Europe) reported substantial performance gains after porting the model to GPU architectures [5]. This thesis presents a reproducibility study of selected GPU performance results of OGSTM-BFM on the Leonardo supercomputer, as well as further investigations that were not included in previous ESiWACE3 reports. A signi cant acceleration can be achieved when using GPUs, provided that appropriate MPI rank-to-GPU mappings and Multi-Process Service (MPS) con gurations are employed. Several con gurations were tested. When we consider the GPU version running on two nodes, with four NVIDIA A100 GPUs per node and four MPI ranks per GPU, compared to two dual-socket nodes with Intel Sapphire Rapids CPUs, this leads to a speedup of 1.64. Once MPS is enabled, performance increases dramatically (speedup of 5.72). Di erent mappings of the ranks to the GPUs were tested. It was found that round-robin mapping combined with 50% MPS (each rank limited to ∼ 50% GPU threads), further increases the speedup to 5.97. Lastly, after adding NUMA binding in the MPS launch path, we managed to achieve a speedup of 6.22. We note that a speedup of 7.41 was reported on the ESiWACE3 Technical Report [5]; however, in this study, we were not able to actually reach this result. In addition, this work evaluates an alternative implementation of the vertical dif fusion tridiagonal solver using NVIDIA's cuSPARSE batched routines. The original implementation solved a system with a tridiagonal matrix by a method that does not allow for full parallelism. The algorithm, however, is highly specialized to tridi agonal matrices. We developed benchmark tests both in isolation and integrated into OGSTM-BFM. The cuSPARSE-based variant in isolation is about 3.33 times slower than the specialized version. When integrated into OGSTM-BFM it leads to 8% slower runs than the baseline in the tested con guration. These results emphasize that increased parallelism alone does not guarantee im proved time-to-solution; they also show that launch-level tuning (MPS, mapping, NUMAplacement) is as important as kernel-level optimization. The thesis provides practical insights into GPU reproducibility, solver integration, and performance en gineering for large-scale scienti c applications.
27-mar-2026
Non assegn
Campanella, Fabio
Bolzon, Giorgio; Girotto, Ivan
File in questo prodotto:
File Dimensione Formato  
thesis_Feitosa_Benevides.pdf

accesso aperto

Tipologia: Tesi
Licenza: Non specificato
Dimensione 1.37 MB
Formato Adobe PDF
1.37 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11767/151812
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact