Spot Machines

Spot Machines (also known as preemptible VMs on GCP) are unused cloud machines available at a significant discount — often 60-90% less than standard on-demand machines. They offer a powerful way to reduce simulation costs. This guide explains how they work and how to use them effectively.

On-Demand MachinesSpot Machines
CostStandard, fixed priceHeavily discounted (60-90% off)
ReliabilityHigh (guaranteed availability)Lower (can be preempted)
Best ForTime-critical tasks, and workloads that cannot be interrupted.Batch processing, fault-tolerant jobs, and cost-sensitive, non-urgent tasks.

How to Use Spot Machines

You can request Spot Machines for any resource type — MachineGroup, ElasticMachineGroup, or MPICluster — by simply setting the spot=True argument during initialization. Enable automatic resubmission by setting resubmit_on_preemption=True in the simulator's run() method.

import inductiva

# Request a Spot Machine by setting spot=True
spot_machine = inductiva.resources.MachineGroup(
    machine_type="c2-standard-30", 
    spot=True
)

Note: If spot=True is not set, Inductiva will launch a standard on-demand machine.

Understanding Preemption

The main drawback of Spot Machines is that the cloud provider can reclaim, or preempt, them at any time. Inductiva provides tools to manage this risk.

What Happens When a Machine is Preempted

When a spot machine running a task is preempted, the following occurs:

  1. Task Interruption: The task running on the machine is immediately stopped. Its status changes to Spot Reclaimed.
  2. Other Machines are Unaffected: If the preempted machine was part of a larger MachineGroup, the other machines in the group are not affected and will continue running.

How to Automate Recovery

Instead of manually resubmitting the interrupted task, you can instruct the API to automatically handle interruptions for you.

To enable this, set resubmit_on_preemption=True in your simulator's run() method. When this flag is active, Inductiva will:

  1. Detect the interruption.
  2. Reschedule the simulation.
  3. Relaunch the task on a new machine.

This provides the best of both worlds: the low cost of Spot Machines combined with the resilience of automated recovery.

Example: Resilient Simulation on a Spot Machine

The following example shows how to correctly launch a fault-tolerant simulation. We request a Spot Machine for cost savings and enable automatic resubmission for reliability.

import inductiva

# 1. Request a Spot Machine for a lower cost.
machine_group = inductiva.resources.MachineGroup(
    machine_type="c2-standard-30", 
    spot=True
)
machine_group.start()

# 2. Configure the simulation to automatically restart if preempted.
swash = inductiva.simulators.SWASH()
task = swash.run(
    on=machine_group,
    resubmit_on_preemption=True  # Enable fault tolerance
)

print(task.get_status())

Note: If spot=True is set but resubmit_on_preemption=True is not set, Inductiva will not re-launch your task if the machine is preempted.

When to Use Spot Machines

✅ Use Spot Machines for:

  • Large-scale batch processing.
  • Simulations that are fault-tolerant.
  • Non-urgent research and development tasks where cost is a concern.
  • Benchmarking.

❌ Avoid Spot Machines for:

  • Time-critical simulations.
  • Short, single-run tasks where the potential delay from a preemption would be significant.
  • Simulations that cannot be easily or cleanly restarted.

In summary:

ProsCons
Significant Cost Savings: Get access to computing power at a fraction of the on-demand price, often with discounts of 60-90%.Preemption Risk: The cloud provider can reclaim the machine at any time, interrupting your simulation.
Ideal for Flexible Workloads: Excellent for tasks that are not time-critical or that can tolerate interruptions.Potential Delays: If a machine is preempted, the task must be restarted, which can lead to longer overall completion times.