GPU Analysis & Results

Benchmarks conducted by Inductiva with technical support from Pedro Costa (TU Delft)

Special thanks to Dr. Baptiste Hardy (TU Delft) for his support in devising this temporal boundary layer setup


This benchmark report presents a performance comparison across various GPU configurations, serving as your trusted guide in selecting the right simulation hardware for your computational CaNS projects.

We benchmark a temporal boundary layer with stable stratification case, following the same scenario detailed in our tutorial.

Results

The benchmarks cover a range of cloud machines with different GPUs. The reference setup is the most affordable and smallest configuration, featuring 4 virtual CPUs (vCPUs) paired with a single NVIDIA L4 GPU. Other configurations that were tested include more powerful machines with increased CPU counts and higher-performance GPUs, such as the NVIDIA A100 and H100, which allow us to evaluate how scaling hardware resources affects simulation speed.

Below is a detailed comparison of execution times and speed-ups across different machine types:

Machine TypevCPUsGPU TypeGPU CountExecution TimeSpeed-upEstimated Cost (USD)
g2-standard-44NVIDIA L4125h, 3 minReference6.86
g2-standard-2424NVIDIA L4215h, 55 min1.57x10.75
a2-highgpu-112NVIDIA A10014h, 44 min5.29x7.38
a2-highgpu-224NVIDIA A10022h, 47 min9.00x8.85
a3-highgpu-126NVIDIA H10012h, 26 min10.29x6.52
a3-highgpu-252NVIDIA H10021h, 36 min15.65x8.64

Table 1: Benchmark results on Inductiva

To further assess performance, we calculated the scaled time per cell, defined as the wall-clock time multiplied by the number of GPUs and divided by the total number of cells. Each time step in CaNS consists of three RK3 substeps, with a large Poisson equation being solved at each substep. In this simulation mode of CaNS (with is_impdiff_1d = T set in the configuration file), the expected scaled time per cell on A100 GPUs is on the order of nanoseconds. These estimates are summarized below:

Machine TypeGPU TypeGPU CountExecution TimeScaled Time per Cell (s)
g2-standard-4NVIDIA L4125h, 3 min1.769e-08
g2-standard-24NVIDIA L4215h, 55 min2.247e-08
a2-highgpu-1NVIDIA A10014h, 44 min3.343e-09
a2-highgpu-2NVIDIA A10022h, 47 min3.929e-09
a3-highgpu-1NVIDIA H10012h, 26 min1.719e-09
a3-highgpu-2NVIDIA H10021h, 36 min2.259e-09

Table 2: Scale time per cell estimates calculated by Pedro Costa

Summary

The benchmark results clearly demonstrate the substantial performance benefits of using higher-end GPUs for CaNS simulations. The best-performing setup, equipped with two NVIDIA H100 GPUs, achieved a 15.7× speed-up over the baseline machine with a single NVIDIA L4 GPU, reducing execution time from 25 hours to just 1 hour and 36 minutes.

These gains, however, are subject to the limits of strong scaling: as GPU count increases while the problem size remains fixed, each GPU handles a smaller workload, leading to reduced occupancy and less-than-linear scaling.

Comparing the measured execution times (Table 1) with theoretical estimates derived from wall-time per grid cell per GPU (Table 2), we observe strong agreement across all machines. The anticipated performance of approximately 1 nanosecond per cell per GPU was consistently achieved, confirming the robustness and reliability of both the Inductiva platform and the CaNS solver.

With Inductiva, you’re able to seamlessly select the hardware that delivers the performance your simulations demand.