Allocating Computational Resources in a Diverse Chip Ecosystem

Luís Sarmento

Author

Maya Hershey

Editor

December 18, 2024

Tags:

Cloud machine allocationBenchmarking tools for cloud computingCloud performance vs. cost analysisAllocation problem in cloud computing explainedHigh-performance computing on AWS, Google Cloud, and Azure
Allocating Computational Resources in a Diverse Chip Ecosystem

In the previous blog post, I discussed how the future of computing is set to become far more diverse than the one we know today—one dominated by processing technologies from Intel, AMD, and NVIDIA. At the forefront of this transformation are the hyperscalersGoogle, Amazon, and Microsoft—who are now bringing their own homebrewed chips to market, both for AI workloads (AI accelerators) and general processing (regular CPUs).

This push by the hyperscalers is just the beginning. Their efforts are poised to dramatically expand the variety of processing options available for running compute jobs, and this trend is only speeding up. Beyond the hyperscalers, other notable players are contributing to this trend. For example, Oracle and Tencent now offer powerful Arm-based chips from Ampere in their cloud portfolios, further diversifying the market.

In this post, we’ll look at how this growing diversity impacts engineers and scientists. We’ll break down key factors like machine families and generations, and show how benchmarking tools can help us navigate this new sea of options.

The Allocation Problem

At first, this growing diversity in hardware options seems like a win for scientists and engineers running large computational jobs, especially on the cloud. More choices should be a good thing, right? With all these new hardware configurations, there’s got to be something out there for every workload, offering optimized performance at a price that makes sense.

But then again, too much of a good thing can become a problem. With so many options, how do you decide? Should you run your job on a cutting-edge AWS machine with a home-brewed Arm-based processor, or go for an older but much cheaper Intel-based machine on Google Cloud? These options could have completely different cost-to-performance ratios. So how do you make sure you’re not making the wrong, and costly, mistake by choosing an inefficient option?

Finding the right machine for a specific computational job that meets your time requirements and stays within your budget is no easy task. This challenge is what we refer to as the Allocation Problem. As the ecosystem grows more complex, the complexity of this allocation problem grows, especially when performance varies based on compute workloads.

The Allocation Problem is figuring out the best hardware for your workload while balancing performance and cost.

Breaking Down the Allocation Problem: 4 Factors You Need to Know

Let’s take a moment to consider the key factors that should be taken into account when selecting a cloud machine and how these factors can impact both the performance and cost of running a computational job.

1- Cloud Providers and Datacenter Locations

The first step in tackling the Allocation Problem is selecting a cloud provider. There are plenty of options out there, but the big three—AWS, Azure, and Google Cloud—cover about two-thirds of the market. These providers offer a broad constellation of tools beyond just compute power, which can be a deciding factor for some users.

Now, here’s where it gets tricky. Prices for what seem like equivalent compute options can vary a lot not just between cloud providers but even between datacenters from the same provider. This can happen for a number of reasons but it often comes down to the cost of electricity or other location-specific operational expenses.

So, to make the best choice, it’s not just about picking the “right” provider or machine type. You also have to think about where that machine is located. Sounds like a lot already, right? But honestly, this is just the warm-up. The real complexity kicks in when we start looking at machine-specific factors. Let’s dive into those next!

2- Machine Families

Cloud providers usually organize their machines into “families,” each tailored to specific types of compute workloads. These families are designed and optimized for particular tasks, and over time, they’ve become highly specialized.

The most common machine families are CPU-based, built for handling general workloads. Within this group, you’ll find options ranging from machines optimized for web servers or database systems to those designed for high-performance computing, equipped with blazing-fast processors and ultra-high bandwidth memory. Traditionally, these CPU-based families use processors from AMD (like the AMD Epyc 9005) or Intel (such as the 5th Generation Xeon). But lately, the big three cloud providers have shaken things up by introducing general-purpose machines powered by their own custom-designed CPUs.

Then there are the machine families optimized for AI. These range from budget-friendly options with a single GPU to powerhouse “AI supercomputers” loaded with multiple high-end Nvidia chips. And just like with CPUs, the big three cloud providers also have exclusive AI-focused families equipped with their own specialized accelerator chips, designed to handle demanding AI workloads.

Generally, within each cloud provider, these machine families are pretty straightforward and can help you match your workload to the right type of machine. But here’s the thing—it doesn’t always mean the most obvious choice is the best one. Sometimes, a less obvious family might actually fit your specific workload better, especially when you factor in your budget and time constraints.

Machine families offer tailored options for specific workloads, from general-purpose CPUs to specialized AI supercomputers, but sometimes a less obvious option might better fit your workload, budget, and time constraints.

Why? Because there are two other key factors to consider: virtualization within the same family and the generation of the family.

3- Machine Virtualization

Instead of giving you direct access to physical hardware (known as “bare metal”), most cloud providers offer Virtual Machines (VMs). These VMs run on top of actual hardware and provide a set number of virtual CPUs (vCPUs) paired with a specific amount of RAM.

VMs are a smart choice for both cloud providers and users. They offer greater flexibility, allowing providers to maximize resource usage while giving users cost-effective and scalable options for their workloads. For most scenarios, VMs strike the right balance between performance and practicality.

To make things easier for users, cloud providers offer a range of “pre-packaged” VM configurations for each machine family. These configurations let users pick machines that fit their specific needs, whether they require more or fewer vCPUs or different amounts of RAM per vCPU. This flexibility helps users choose the right size machine to match their workload and current requirements.

So, within the same family—let’s say a high-end one optimized for High-Performance Computing with the latest processors and high-bandwidth memory—a user has plenty of flexibility. For example, they might start with a cheaper pre-packaged VM with 16 vCPUs and 2GB of RAM per vCPU to run initial tests on a simulation. Then, when it’s time to run the full simulation, they could scale up to a massive VM with 360 vCPUs and 8GB of RAM per vCPU—nearly 3TB of RAM!

These pre-packaged VMs come with a fixed amount of RAM per vCPU, which determines their variant. Common levels include 1GB, 2GB, 4GB, and 8GB of RAM per vCPU. For instance, VMs with 8GB of RAM per vCPU are often labeled as “highmem” variants, while those with 2GB or 4GB per vCPU are typically part of the “standard” variant. This setup helps users easily pick the right balance of memory and compute power for their workloads.

Naturally, more powerful machines come with higher costs. In most cases, the price scales linearly with the number of vCPUs—double the vCPUs, and you’ll double the cost. Adding more RAM to the VM also increases the price. For example, “highmem” variants, which offer more RAM per vCPU, are typically 15% to 30% more expensive than “standard” variants with the same number of vCPUs.

Now, just because you run the same job on a VM with twice as many vCPUs doesn’t mean it will run twice as fast—but your costs will definitely double. In fact, the performance boost might be minimal for various reasons, while the cost increase could be significant. Similarly, using a VM with double the RAM might speed up your job in some cases—or it might not—but it will always come with a higher price tag.

Choosing the right VM means balancing performance and cost, keeping in mind that more power doesn’t always mean better efficiency for every workload.

So, how do you know if it’s worth upgrading to more powerful VMs?

That’s a great question.

But hold on—we’re not done yet. There are still more factors to consider.

4- Machine Generation

Another important factor that can have a big impact on the cost and performance of your computational job is the generation of the machine—essentially, how old it is. It’s no surprise that newer machines tend to be faster and more efficient, but they’re also more expensive to purchase and maintain, especially at scale. The age of the machine plays a big role in balancing performance and cost.

Let’s take a closer look at what happens when cloud providers introduce a new machine family, like one optimized for high-performance computing. To keep things simple, let’s focus on the big three cloud providers—Google, AWS, and Microsoft—who manage their own datacenters and handle all the hardware directly.

The cloud provider would need to decide on the optimal machine configuration. This includes selecting the type of CPU, GPU, and motherboard, determining the amount of RAM and storage, and choosing the best I/O connections, among other design choices. Once they settle on what they consider the ideal setup, they make a massive investment—often in the tens of millions of dollars—to fill their datacenters with thousands of these machines.

Once these top-of-the-line machines are installed in the datacenters, they’re there for the long haul. Investments of this scale take time to pay off—often five years or more—before the costs are fully recovered.

In the meantime, technology doesn’t wait around. Within a year or two, newer and faster processors, improved motherboards, and cheaper RAM become available. At that point, the cloud provider decides it’s time to update the same family of high-performance computing machines for its users. They go through the purchasing process again, investing millions of dollars to bring another generation of the same family to life. Once more, thousands of new compute nodes are deployed across multiple datacenters.

Now, two generations of the same machine family will coexist, both available for users to choose from. In fact, it’s pretty common to see three generations of the same family running side by side. For example, Google Cloud offers three generations of its Intel-based “compute optimized” machines: the c2, c3, and c4 families.

Cloud providers continually invest in newer machine generations, creating a cycle where multiple generations of the same family coexist. Newer machines offer better performance but come at a higher cost, requiring careful evaluation.

Obviously, the newer machines are faster and more powerful than the older ones, but they also come with a higher price tag. For a given computational job, is it actually worth paying the premium to use the faster machines? Does the 30% higher cost lead to a more than proportional reduction in execution time, making the computation not only faster but also cheaper in the end?

Where Does This Leave Us With the Allocation Problem?

The sheer number of possibilities to consider when choosing the best option for your specific job is staggering. It boils down to:

# cloud providers × # datacenters × # families × # generations × # VM options

That’s a massive number of combinations. Even if you narrow it down by sticking to a specific cloud provider and datacenter location—say, the one closest to you—you’re still left with:

# families × # generations × # VM options

And that alone can mean sorting through hundreds of options.

Then, how do you find the best machine for your computational job—one that meets your time requirements and minimizes costs? This question becomes even more critical when dealing with large-scale workloads, like multi-day simulations or running thousands of simulations at once.

For example, let’s say you’re training AI models to solve specific Computational Fluid Dynamics (CFD) scenarios, like simulating a wind tunnel. To do this, you need to generate a dataset by running around 20,000 wind tunnel simulations using a traditional CFD simulator like OpenFOAM. That’s a huge computational workload, and it can get very expensive if you end up choosing the wrong machines.

So, what’s the best VM option for running those 20,000 simulations in a reasonable time and at the lowest possible cost?

The short answer: with so many factors at play, you can’t really know until you try!

Benchmarking for the Goldilocks Machine

In our previous blog post, we talked about how the growing diversity of computing options is transforming the landscape, and how it’s not always easy to find the best machine for your job. That’s exactly why we’re so excited about release v0.12 of our API! With its powerful new benchmarking tools, we’re making it easier than ever to take the guesswork out of machine selection. Our team has been working hard to bring you a smarter, more efficient way to tackle the Allocation Problem.

Here’s the general idea behind the new benchmarking tools we’re bringing to you. Imagine you need to run a massive simulation job. This could be a single, large-scale simulation requiring thousands of core hours of computation. Or, as in the dataset generation example we mentioned earlier, it could involve running thousands of smaller simulations that collectively add up to a significant amount of compute time.

Our new benchmarking functionality is designed to make this easier. It allows you to automatically run a sample simulation across the many VMs you want to evaluate—potentially hundreds of options. With these tools, you can easily submit your sample simulation in parallel to VMs from different families, generations, and configurations of vCPUs and RAM, all through a simple Python script.

Breaking Down the Benchmarking Process

For example, let’s say you’re modeling a coastal engineering scenario where you need to simulate six hours of waves hitting the shore under specific sea and wind conditions. Based on previous experience, you estimate that the job could take anywhere from 3 to 5 days on a good machine with a few hundred vCPUs.

But you’re unsure if your problem can truly take advantage of the high number of vCPUs offered by your cloud provider or if the newer, faster machines on the list will provide any real benefit. You’re also uncertain whether paying extra for machines with additional RAM is worth it since your simulation may or may not benefit from the added memory, depending on the number of vCPUs you use.

That’s a lot of “I don’t knows” and tricky interdependencies to navigate. But with our benchmarking tools, you can test all those options at once and get the answers you need.

The first step is to adjust the parameters of your simulation to create a shorter sample version. For example, instead of simulating the full six hours of wave activity, you could simulate just 2 minutes. This drastically reduces the runtime, allowing the sample to be completed in just a few minutes instead of the 3 to 5 days required for the full simulation.

Once you have your sample ready, you can use the Inductiva API to run it in parallel across dozens of different machine configurations. This lets you test all the possible options quickly and efficiently, saving you time and helping you find the best fit for your workload.

In the end, you’ll get a detailed report showing the performance and cost metrics for all the test runs. With this information, you can confidently decide which VMs are the best fit for your full 3 to 5-day job. No more guessing—you’ll get the performance facts before committing to the full workload.

Over time, we hope our benchmarking tool will help you gain a deeper understanding of the cost-to-performance ratio across the various compute options available on different cloud providers. Ultimately, it’s designed to help you save both money and time.

If this sounds right up your alley, stay tuned—version 0.12 of the API is coming very soon!

 

Check out our blog

Banner image for blog post on V0.12 release

Inductiva API v0.12: Benchmarking and Beyond

Discover the latest features in Inductiva API v0.12, including powerful benchmarking tools to optimize performance and costs, enhanced usability, and more!

banner for chip-design-renaissance-economics blog post

The New Renaissance of Chip Design and Its Economics

Explore the new renaissance in chip design as hyperscalers like Amazon, Google, and Microsoft transform the industry. Discover the economic and technological shifts shaping the future of computing and what it means for engineers, scientists, and the chip ecosystem.

Banner image for blog post on challenges adapting SWAN’s executable

The Black SWAN: Debugging the Unexpected

Dive into our challenges switching SWAN’s executable, uncovering peculiarities in MPI and machinefile behavior across local and cloud environments.