Join our global community on Discord and let's shape the next wave of simulation and scientific computing Let's shape the future of simulation together

Join us on Discord

GPU Analysis & Results

Benchmarks conducted by Inductiva with technical support from Pedro Costa (TU Delft)

Special thanks to Dr. Baptiste Hardy (TU Delft) for his support in devising this temporal boundary layer setup

This benchmark report presents a performance comparison across various GPU configurations, serving as your trusted guide in selecting the right simulation hardware for your computational CaNS projects.

We benchmark a temporal boundary layer with stable stratification case, following the same scenario detailed in our tutorial.

Results

The benchmarks cover a range of cloud machines with different GPUs. The reference setup is the most affordable and smallest configuration, featuring 4 virtual CPUs (vCPUs) paired with a single NVIDIA L4 GPU. Other configurations that were tested include more powerful machines with increased CPU counts and higher-performance GPUs, such as the NVIDIA A100 and H100, which allow us to evaluate how scaling hardware resources affects simulation speed.

Below is a detailed comparison of execution times and speed-ups across different machine types:

Machine Type	vCPUs	GPU Type	GPU Count	Execution Time	Speed-up	Estimated Cost (USD)
g2-standard-4	4	NVIDIA L4	1	25h, 3 min	Reference	6.86
g2-standard-24	24	NVIDIA L4	2	15h, 55 min	1.57x	10.75
a2-highgpu-1	12	NVIDIA A100	1	4h, 44 min	5.29x	7.38
a2-highgpu-2	24	NVIDIA A100	2	2h, 47 min	9.00x	8.85
a3-highgpu-1	26	NVIDIA H100	1	2h, 26 min	10.29x	6.52
a3-highgpu-2	52	NVIDIA H100	2	1h, 36 min	15.65x	8.64

Table 1: Benchmark results on Inductiva

To further assess performance, we calculated the scaled time per cell, defined as the wall-clock time multiplied by the number of GPUs and divided by the total number of cells. Each time step in CaNS consists of three RK3 substeps, with a large Poisson equation being solved at each substep. In this simulation mode of CaNS (with is_impdiff_1d = T set in the configuration file), the expected scaled time per cell on A100 GPUs is on the order of nanoseconds. These estimates are summarized below:

Machine Type	GPU Type	GPU Count	Execution Time	Scaled Time per Cell (s)
g2-standard-4	NVIDIA L4	1	25h, 3 min	1.769e-08
g2-standard-24	NVIDIA L4	2	15h, 55 min	2.247e-08
a2-highgpu-1	NVIDIA A100	1	4h, 44 min	3.343e-09
a2-highgpu-2	NVIDIA A100	2	2h, 47 min	3.929e-09
a3-highgpu-1	NVIDIA H100	1	2h, 26 min	1.719e-09
a3-highgpu-2	NVIDIA H100	2	1h, 36 min	2.259e-09

Table 2: Scale time per cell estimates calculated by Pedro Costa

Summary

The benchmark results clearly demonstrate the substantial performance benefits of using higher-end GPUs for CaNS simulations. The best-performing setup, equipped with two NVIDIA H100 GPUs, achieved a 15.7× speed-up over the baseline machine with a single NVIDIA L4 GPU, reducing execution time from 25 hours to just 1 hour and 36 minutes.

These gains, however, are subject to the limits of strong scaling: as GPU count increases while the problem size remains fixed, each GPU handles a smaller workload, leading to reduced occupancy and less-than-linear scaling.

Comparing the measured execution times (Table 1) with theoretical estimates derived from wall-time per grid cell per GPU (Table 2), we observe strong agreement across all machines. The anticipated performance of approximately 1 nanosecond per cell per GPU was consistently achieved, confirming the robustness and reliability of both the Inductiva platform and the CaNS solver.

With Inductiva, you’re able to seamlessly select the hardware that delivers the performance your simulations demand.

Run a Temporal Boundary Layer with Stable Stratification Case

FAQ

Find answers to commonly asked questions about CaNS.