GCP Spot VMs Explained: Preemptible vs Spot, Use Cases, and Risks
A Spot VM is a standard Compute Engine VM that runs on spare capacity Google has available at any given moment. In exchange for a 60–91 percent discount, you accept that GCP can stop the VM with 30 seconds notice when it needs that capacity back. For batch jobs, CI runners, and rendering pipelines, the trade-off is worth it. For databases and customer-facing APIs, it is not.
Simple explanation
Google’s data centres always have some compute capacity sitting idle: servers that are not fully booked by other customers at that moment. Spot VMs let you rent that idle capacity at a large discount.
The catch is that the capacity is not reserved for you. If demand for on-demand VMs rises and Google needs the underlying hardware, your Spot VM gets a 30-second termination notice and then stops. You have no say in when that happens.
The rule of thumb: if your workload can be paused and restarted from a checkpoint without losing meaningful progress, Spot VMs are worth considering. If any interruption causes data loss, service downtime, or a failed customer request, they are the wrong choice.
A Spot VM is like a standby airline seat. You get a heavily discounted fare, but if a full-price passenger needs your seat, you get bumped with short notice. If your journey can be paused and resumed later, the discount is worth it. If you must arrive on time for a meeting, pay full price.
Preemptible vs Spot VMs
Spot VMs are the modern replacement for preemptible VMs. Both use spare capacity and both can be interrupted at any time. The practical differences are small but matter for long-running workloads.
| Feature | Preemptible | Spot |
|---|---|---|
| Maximum runtime | 24 hours | No limit |
| Termination notice | 30 seconds | 30 seconds |
| Can be interrupted by GCP | Yes | Yes |
| Pricing | Fixed discount rate | Fixed discount rate |
| Recommended for new workloads | No, being phased out | Yes |
In practice, the only meaningful difference is the 24-hour cap on preemptible VMs. A job that runs longer than 24 hours will be forcibly stopped by GCP even if no capacity pressure exists. That is simply how preemptible VMs work. Spot VMs have no such constraint, which makes them the correct choice for long-running batch workloads.
If you see an older tutorial using the —preemptible flag,
the modern equivalent is —provisioning-model=SPOT. The two
behave similarly, but Spot VMs have no 24-hour runtime cap and are the
type Google recommends for all new workloads.
How Spot VMs work
When you request a Spot VM, GCP checks whether spare capacity exists in the requested zone for the machine type you chose. If it does, the VM starts immediately. If not, the request fails straight away. GCP does not queue Spot VM requests.
While the VM is running, GCP continuously monitors demand across its fleet. If on-demand demand rises and spare capacity needs to be reclaimed, GCP selects Spot VMs for termination. Your VM receives a 30-second notice via the metadata server and is then either stopped or deleted, depending on the termination action you configured when creating it.
The 30-second window is the only opportunity your application has to react. Anything that cannot complete in that time (a large upload, a complex transaction) will be cut short. This is why architecture matters more than raw compute price: a Spot VM running an application with no interruption handling is not cheaper, it is simply less reliable.
How to create a Spot VM
Creating a Spot VM uses the same gcloud compute instances create
command as a standard VM. Two flags make it a Spot VM.
gcloud compute instances create my-batch-worker \
--machine-type=e2-standard-4 \
--image-family=debian-12 \
--image-project=debian-cloud \
--zone=us-central1-a \
--provisioning-model=SPOT \
--instance-termination-action=STOPKey flags explained:
—provisioning-model=SPOT: marks this as a Spot VM instead of a standard on-demand instance.
—instance-termination-action=STOP: the VM halts on preemption and the boot disk persists. You can restart the VM manually later, or let a managed instance group do it automatically.
—instance-termination-action=DELETE: the VM and its boot disk are deleted on preemption. Use this for ephemeral workers where the disk has no value after interruption; it avoids paying for a stopped disk you will never restart.
For production batch workloads, consider using an instance template and a managed instance group rather than creating individual Spot VMs. A MIG automatically replaces preempted instances so your worker pool stays at the desired size without manual intervention.
How to detect interruption
GCP exposes two metadata server endpoints that your application can poll from inside the VM to detect an incoming termination notice before it arrives.
The first endpoint returns TRUE once preemption is imminent.
The second returns the maintenance event type: TERMINATE when
a stop is pending, or NONE when no event is in progress.
# Returns TRUE when preemption is imminent
curl -s -H "Metadata-Flavor: Google" \
"http://metadata.google.internal/computeMetadata/v1/instance/preempted"
# Returns TERMINATE when a stop event is pending, NONE otherwise
curl -s -H "Metadata-Flavor: Google" \
"http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event"Poll these endpoints every 5 to 10 seconds from a background thread or sidecar process. When either signals a termination, your application should stop accepting new work, write a checkpoint to Cloud Storage or an external database, flush pending writes, and drain active connections. Thirty seconds is enough for a checkpoint write; it is not enough for a large file upload or a long database transaction.
When to use Spot VMs
Spot VMs are the right choice when your workload is interruptible and restartable. The most common good fits are:
Batch data processing. Jobs that process data in chunks and checkpoint between them can resume from the last saved point after preemption, with minimal wasted work. See Batch Jobs in GCP for how Cloud Batch handles Spot VM interruptions and task-level retries natively.
CI/CD runners. A failed build just reruns. On fast pipelines the wasted time is small, and the savings across a large runner fleet add up quickly.
Video or image rendering. Individual frames or segments are independent. A preempted worker loses at most one segment; the rest complete normally.
Distributed data processing (Dataflow, Spark). Frameworks like Dataflow and Spark handle worker-level failures natively by reassigning incomplete tasks to other workers. Spot VMs integrate directly with Dataproc for this reason.
Machine learning training with checkpointing. Modern ML frameworks checkpoint model weights to persistent storage. Training resumes from the last checkpoint after preemption, with minimal wasted compute.
Fault-tolerant queue workers. Workers that pull tasks from a Pub/Sub subscription or Cloud Tasks queue and acknowledge only on success are naturally restartable. A preempted worker leaves its task unacknowledged, and another worker picks it up.
Test and development environments where an occasional interruption is acceptable and cost matters more than guaranteed availability.
When not to use Spot VMs
Running a stateful database on a Spot VM is the most common and most damaging mistake beginners make. A preemption mid-write can leave data in a corrupt state that is difficult or impossible to recover cleanly. If you need a managed database, use Cloud SQL or Firestore instead.
| Workload type | Use Spot? | Why not |
|---|---|---|
| Stateful databases | No | Preemption during in-flight writes risks data corruption. Use Cloud SQL or a dedicated on-demand VM. |
| Latency-sensitive APIs | No | Interruptions cause request failures and break SLOs. Customer experience degrades unpredictably. |
| Single-instance production services | No | A single preempted Spot VM is a complete outage with no fallback. |
| Workloads with no checkpointing or retry logic | No | All progress is lost on every interruption. The discount does not offset restarting from zero repeatedly. |
| Real-time streaming ingestion | No | Gaps in ingestion pipelines are difficult to backfill accurately and can corrupt downstream analytics. |
How to design for interruption
The discount only pays off if the application handles interruptions cleanly. These patterns make Spot VM workloads genuinely reliable:
Checkpoint progress frequently. Write intermediate results to Cloud Storage or an external database at regular intervals, not just at the end. A six-hour batch job with no checkpointing loses all six hours of work in a single preemption.
Use durable storage for state. Never keep job state only on the VM’s local disk or ephemeral storage. If the VM is deleted on preemption, that data is gone.
Make jobs idempotent. Each unit of work should be safe to run more than once. If a restarted worker reprocesses a record it already handled, the result should be the same as if it had run only once.
Use a MIG to replace preempted VMs automatically. A managed instance group replaces preempted VMs without manual intervention. Pair it with autoscaling to grow and shrink the worker pool based on queue depth or CPU load.
Mix Spot and on-demand in one MIG. Configure a primary Spot instance template with an on-demand fallback. The group maintains capacity even when Spot supply is tight.
Prefer smaller machine types. Smaller machines such as
e2-standard-2ande2-standard-4have better Spot availability than large or specialised types, because Google has more spare capacity at those sizes. See Machine Types Explained for a guide to the E2, N2, and C2 families and when to use each.Spread across zones. Use a regional MIG or deploy workers across multiple zones so that zone-level capacity constraints affect a smaller proportion of your fleet.
# Create an instance template for Spot workers
# The startup script resumes from a checkpoint on every boot
gcloud compute instance-templates create spot-worker-template \
--machine-type=e2-standard-4 \
--image-family=debian-12 \
--image-project=debian-cloud \
--provisioning-model=SPOT \
--instance-termination-action=DELETE \
--metadata=startup-script='#!/bin/bash
set -e
# Resume from last checkpoint if one exists
if gsutil -q stat gs://my-results/checkpoint.json; then
gsutil cp gs://my-results/checkpoint.json /tmp/checkpoint.json
fi
python3 /opt/worker/process.py'
# Create a MIG using the Spot template
gcloud compute instance-groups managed create spot-workers \
--template=spot-worker-template \
--size=10 \
--zone=us-central1-aThe startup script checks for an existing checkpoint file on every boot, so a replacement VM picks up where the preempted one left off. See VM Startup Scripts for more detail on writing reliable startup scripts.
Common beginner mistakes
Running a database on a Spot VM. Preemption during an in-flight write can corrupt data. Databases need graceful shutdown sequences that a 30-second notice cannot reliably accommodate. Always run databases on standard on-demand VMs or use a managed service like Cloud SQL.
Treating Spot VMs like cheaper standard VMs. Spot VMs are a fundamentally different reliability class, not just a discount. Putting a service on Spot VMs without designing for interruption makes it less reliable without the benefits that come from properly handling interruptions.
Not testing interruption handling before going live. Checkpoint logic that looks correct in code often has edge cases that only surface when the VM actually stops mid-job. Simulate preemption by stopping the VM manually and verifying that the restarted worker resumes correctly from the saved checkpoint.
Not checkpointing long-running jobs. A job with no checkpoints restarts from zero on every preemption. With even a modest interruption rate, throughput collapses. Write progress to durable storage at regular intervals; every few minutes is a reasonable starting point for most batch workloads.
Assuming Spot capacity is always available. Spot availability varies by machine type, zone, and time of day. When GCP has no spare capacity, the VM creation request fails immediately. Have a fallback: a different zone, a smaller machine type, or on-demand instances via a MIG fallback template.
Using preemptible VMs instead of Spot for new workloads. Some older tutorials still show the
—preemptibleflag. Preemptible VMs stop after 24 hours regardless of capacity pressure, which makes them unsuitable for jobs that run longer. Use—provisioning-model=SPOTfor all new work.
Spot VMs vs standard VMs
| Aspect | Spot VM | Standard (on-demand) VM |
|---|---|---|
| Cost | 60–91% cheaper | Full price |
| Availability guarantee | None; GCP can stop at any time | Runs until you stop it |
| Maximum runtime | Unlimited, subject to interruption | Unlimited |
| Termination notice | 30 seconds | None; you control shutdown |
| Best workload fit | Fault-tolerant, restartable jobs | Any workload |
| Operational complexity | Higher; requires interruption handling | Low |
The choice is not always one or the other. A common production pattern is to run stateful services and APIs on standard VMs while using Spot VMs for background processing, nightly batch jobs, and CI pipelines. See the Compute Engine cost optimisation guide for other techniques: committed use discounts, rightsizing, and scheduling that complement Spot VMs well.
Frequently asked questions
What is the difference between Spot and preemptible VMs in GCP?
Both types use spare capacity and can be stopped by GCP at any time with 30 seconds notice. The key difference is that preemptible VMs always stop after 24 hours regardless of capacity pressure, while Spot VMs have no runtime cap. Use Spot VMs for all new workloads. Preemptible VMs are being phased out.
How much cheaper are Spot VMs?
Typically 60 to 91 percent cheaper than equivalent on-demand VMs. The exact discount depends on machine type and region. Unlike AWS Spot, GCP Spot VM pricing is fixed per machine type rather than auction-based, so the price does not fluctuate. See Spot VMs for cost savings for figures and worked examples.
How much notice do you get before a Spot VM stops?
GCP gives 30 seconds notice. Poll the metadata server from inside the VM to detect the notice before it becomes a termination. That window is enough to checkpoint state or flush pending writes to Cloud Storage, but not enough to complete a large upload or a complex database transaction. Design for the constraint from the start.
Can Spot VMs be used in production?
Yes, for the right kind of production workload. A fleet of Spot batch workers inside a managed instance group that replaces preempted VMs automatically is a common production pattern. What you should not do is run a stateful service, database, or single-instance API on a Spot VM with no fallback. See Designing Highly Available Systems for patterns that combine reliability and cost efficiency.
What happens to the disk or data when a Spot VM is interrupted?
It depends on the termination action you configured. With STOP,
the VM halts and the boot disk persists; you can restart it later. With
DELETE, both the VM and boot disk are deleted. Data stored in
Cloud Storage, Cloud SQL, or another external service is unaffected by
either termination action.
Summary
- Spot VMs cost 60–91% less than on-demand VMs but can be stopped at any time with 30 seconds notice
- Spot VMs replace the older preemptible type; the key improvement is no 24-hour runtime cap
- Poll the metadata server from inside the VM to detect termination and checkpoint state in time
- Good fits: batch processing, CI runners, rendering, ML training, fault-tolerant queue workers
- Bad fits: databases, stateful services, latency-sensitive APIs, workloads with no retry logic
- Managed instance groups make Spot VM fleets self-healing; preempted VMs are replaced automatically
Frequently asked questions
What is the difference between Spot and preemptible VMs in GCP?
Both types use spare capacity and can be stopped by GCP at any time with 30 seconds notice. The key difference is the 24-hour runtime cap: preemptible VMs always stop after 24 hours regardless of capacity pressure, while Spot VMs have no such cap. Use Spot VMs for all new workloads. Preemptible VMs are being phased out.
How much cheaper are Spot VMs?
Spot VMs are typically 60–91% cheaper than equivalent on-demand VMs, depending on machine type and region. Savings are highest on larger machine types. The discount is fixed, not auction-based like AWS Spot.
How much notice do you get before a Spot VM stops?
GCP sends a 30-second termination notice before stopping a Spot VM. You can detect this from inside the VM by polling the metadata server at /computeMetadata/v1/instance/preempted or /computeMetadata/v1/instance/maintenance-event. Use that window to checkpoint progress or flush pending writes.
Can Spot VMs be used in production?
Yes, if the workload tolerates interruption. A fleet of Spot batch workers inside a managed instance group that replaces preempted VMs automatically is a legitimate production pattern. Do not use Spot VMs for stateful databases, latency-sensitive APIs, or any single-instance service with no fallback.
What happens to the disk or data when a Spot VM is interrupted?
It depends on the termination action you set. With STOP, the VM halts and the boot disk persists; you can restart it later. With DELETE, both the VM and boot disk are deleted. Data stored in Cloud Storage, Cloud SQL, or another external service is unaffected in either case.