The Black SWAN: Debugging the Unexpected

Paulo Barbosa

Author

Maya Hershey

Editor

December 11, 2024

Tags:

SWANDebugging ChallengesMPI MachinefileCloud vs. Local ComputingDocker and ApptainerSWANrunInductiva API

Banner image for blog post on challenges adapting SWAN’s executable

At Inductiva, we spend a lot of time making large-scale simulation workflows accessible, no matter where our users choose to run them. Whether it’s on the cloud or local/on-prem setups, or, very soon, massive HPC infrastructures, we want to make sure that every simulator we integrate runs flawlessly across all environments.

Achieving this level of consistency isn’t always smooth sailing. Each environment comes with its own quirks, from substantially different hardware configurations to unique system requirements. But sometimes, the biggest challenges don’t stem from external factors, they come from the simulators themselves.

Take SWAN (Simulating WAves Nearshore), for example. It is a widely-used hydraulics simulator that has become a cornerstone for coastal engineering, environmental studies, and maritime safety. Adapting SWAN to run on local machines was meant to be a straightforward task. After all, we had already ensured its smooth operation on the cloud. As we have everything containerized, this should not have been that difficult.

But SWAN had other plans: what started as a routine integration quickly spiraled into a black swan event of undocumented features and hidden complexities. Paulo, our engineer in charge of this mission, found himself on a three-plus hour debugging process. Let us walk you through what happened, step by step.

SWAN, Containers, and a Determined Engineer

To make sure SWAN could run properly across all environments, we needed the right combination of tools, technologies, and sheer determination. Each part played a key role, and together, they would define the outcome. But before we dive in, let’s introduce some key terms and naming conventions that will come up throughout this post. When we refer to virtual machines from Google Cloud, we’ll use their standard naming scheme, like c2-standard-4.

The final number in the name (4 in this case) indicates the number of virtual CPUs (vCPUs) the machine has. A vCPU represents a processing unit in a virtualized environment, supported by physical cores in a real CPU. By default, Google Cloud enables CPU hyper-threading, meaning each physical core supports two vCPUs. For example:

A c2-standard-4 machine has 4 vCPUs, supported by 2 physical cores.
A c2-standard-16 machine has 16 vCPUs, supported by 8 physical cores.

With these terms in mind, let’s dive in.

Docker and Apptainer

First, we have Docker and Apptainer. These tools help us create isolated environments called containers, where applications can run smoothly no matter the system. Docker is widely used for building, sharing, and running containers, making it popular for software development. Apptainer (formerly Singularity) is often used in scientific computing because it’s better suited for high-performance computing (HPC) systems and is more secure for multi-user environments. Together, they package everything SWAN needs; code, libraries, and settings, so it works consistently across different setups.

MPI (Message Passing Interface)

Backing SWAN’s performance is MPI (Message Passing Interface). MPI speeds up simulations by splitting tasks into smaller pieces and running them at the same time. This “divide and conquer” approach significantly improves computational efficiency and speeds up processing, but it adds extra challenges when working across different environments.

Paulo, The Determined Engineer

And finally, there’s Paulo, our determined engineer. Paulo is focused on one thing: creating the best possible experience for scientists and engineers who want to run numerical simulations at scale. He’s the one making sure simulators like SWAN run smoothly, whether on the cloud, on-prem machines, or local setups.

The SWAN Switch: From Errors to Insights

Once compiled, SWAN generates two executables: swan.exe and swanrun. When we learned from user feedback that swanrun was the preferred way to execute simulations over swan.exe, we knew it was time to make a change.

Transitioning to swanrun in our latest update seemed straightforward at first, but it introduced some challenges, primarily a new requirement for something called a machinefile to enable MPI.

Imagine you’re running a simulation on one of our c2-standard-30 machines, which has 30 VCPUs. You prepare your SWAN input files and try to execute the simulation using all the 30 available threads using the command:

swanrun -input file.swn -mpi 30

But, instead of seeing your simulation take off, you see this:

***ERROR: no machinefile is present in current directory!

It’s not exactly the user-friendly experience we are looking for.

What’s a machinefile?

At its core, a machinefile is a simple text file that tells MPI the available resources for your simulation. It lists the machines available for the computation and the number of processing nodes (or “slots”) each machine can use (localhost slots=X). Such configuration ensures SWAN can fully utilize parallel computing capabilities to deliver faster simulation results. Even if you’re running SWAN on a single machine, MPI still needs this file to understand the configuration.

For example, on a c2-standard-30 machine, you’d need a machinefile containing the following line: localhost slots=30

This tells MPI that:

The simulation will run on your local machine (localhost).
The machine has 30 available processing nodes/vCPUs (slots=30).

Without this setup, SWAN can’t fully utilize parallel computing, and the process won’t proceed.

When SWAN Refused to Play Nice

“It all started with a test run on a c2-standard-4 machine in the cloud”, said Paulo. Since this machine has 4 vCPUs the respective machinefile was something like this: localhost slots=4

So, if the machine has 4 vCPUs, this setup should have presumably worked for a simulation requiring only 3 slots:

swanrun -input file.swn -mpi 3

Instead, one of Paulo’s cloud simulations failed with a cryptic error message:

Not enough slots available.

“At first glance, the error didn’t make any sense. The machinefile clearly defined 4 slots, corresponding to the 4 vCPUs that the machine has, and the simulation only needed 3!”

To rule out any configuration issues, Paulo removed the machinefile entirely, expecting to see the standard error:

No machinefile detected.

Instead, Paulo saw the same message:

Not enough slots available.

“It was as if SWAN was mocking me, ignoring the very machinefile it claimed to so-desperately need previously.”

To pinpoint the issue, Paulo decided to move to one of our local machines, equipped with an AMD Ryzen 7 7700X processor, which supports 16 threads. Without virtualization in this setup, each thread functions as its own vCPU, providing a straightforward testing environment. Running the simulation manually in the SWAN Docker container, having no machinefile, confirmed what he expected:

Error: No machinefile detected

That seemed correct. The machinefile was really not there. So, was this a cloud-specific issue? To test, he switched back to a Google Cloud machine, using the exact same setup as on the local machine; machinefile, simulation command, and even the same Apptainer image:

apptainer run swan.sif swanrun -input file.swn -mpi 3

Yet again, the cloud simulation stubbornly threw the “Not enough slots available” error, as if the machinefile didn’t even exist.

“It was puzzling! On local machines, SWAN respected the machinefile and required its presence. On cloud machines, the simulation ignored the file altogether, yet still complained about unavailable slots.”

This was even more confusing given that the simulation was only asking for 3 slots on machines with 4 vCPU and 16 threads, both of these machines seemed more than capable of running this simulation.

Paulo tested every variable; Docker vs. Apptainer, local vs. cloud environments, MPI versions, and even searching the machinefile character by character in search of a typo. But nothing explained why SWAN behaved so differently in environments that were, on paper, identical.

Cracking the SWAN Code

After countless hours of testing, debugging, and some serious head-scratching, we finally had a breakthrough. One suggestion from our teammate Luis Cunha shifted the perspective:

“What if swanrun isn’t actually a binary? What if it’s just a shell script?”

As it turns out, that’s exactly what it was: swanrun was a shell script.

Digging into the script revealed the root cause of all our headaches. SWAN’s swanrun behavior was determined by the number of vCPU/threads supported by the machine.

Here’s the catch:

Machines with more than 8 vCPUs/threads (like our local AMD machine with 16 threads) require a machinefile. If the file is missing, swanrun checks for its existence and throws an error, a behavior that seems reasonable for managing resources.

Machines with 8 or fewer vCPUs/threads (like the cloud c2-standard-4 machine) completely ignore the machinefile. Instead, MPI automatically determines the available resources based on the number of physical cores, not the vCPUs/threads. On the c2-standard-4 machine, which has 4 vCPUs supported by 2 physical cores, this results in only 2 slots being available for parallelization. When we asked MPI to run on 3 slots, it failed with the error: Not enough slots available.

Interestingly, when we ran the same simulation on the c2-standard-4 with 2 slots, everything worked perfectly. Even though the machinefile was ignored, the simulation executed as expected because MPI correctly identified the 2 physical cores available for processing.

This quirk explained everything. The fact that we had been testing on “small” cloud machines with just 4 vCPUs (2 physical cores) had put us in a weird corner case unrelated to our setup. It also confirmed that our change from using swan.exe to swanrun didn’t introduce any bugs, the behavior we observed was exactly what SWAN was designed to do.

We walked away from this with a deeper understanding of SWAN, a solution for our users, and knowledge that will undoubtedly help us tackle future challenges with other simulators.

What’s Next?

This story isn’t just about debugging, it’s about the challenges and triumphs that come with making cutting-edge simulators like SWAN accessible to all. It’s part of a larger journey, and we’re excited to share more developer stories like this in the future.

If you enjoyed this, you might also like our FVCOM compiling handbook, where we tackled similar challenges with another popular simulator, FVCOM.

Stay tuned for updates, and explore our other built-in simulators, such as AMR-Wind, CaNS, DualSPHysics, FDS, GROMACS, NWChem, OpenFOAM, OpenFAST, Reef3D, SCHISM, SPlisHSPlasH, SWAN, SWASH, and XBeach. If you’ve encountered any further issues with SWAN, feel free to reach out.

Register today and enjoy $5 USD in free credits to explore our API’s full features, run your favorite simulators on the cloud, and access compute resources!