Managed Instance Groups in GCP: Autohealing, Rolling Updates, and Scaling

A managed instance group runs a fleet of identical Compute Engine VMs as a single managed unit. Instead of creating and maintaining individual VMs, you define the configuration once in an instance template and let the group handle the rest: replacing unhealthy VMs, scaling with traffic, and deploying updates gradually without downtime.

Simple explanation

If you have never used a managed instance group before, here is the core idea:

  • A MIG is a group of identical VMs, all created from one instance template.
  • You tell the group how many VMs you want. GCP keeps that count maintained.
  • If a VM becomes unhealthy or crashes, the group deletes it and creates a new one automatically.
  • When you want to deploy an update, you create a new template version and the group replaces VMs gradually. Old ones go down, new ones come up.
  • MIGs work best for stateless applications: web servers, API backends, background workers. Anything where it does not matter which specific VM handles a given request.
Analogy

Think of a MIG like a vending machine stocked with identical items. You define the item spec once (the instance template). The machine (the MIG) keeps the right number in stock. If one item is found faulty, it is discarded and replaced automatically. You never interact with individual items — you just set the desired quantity and the machine handles the rest.

Why managed instance groups matter

Running VMs individually works fine for one or two machines. Once you have three or more doing the same job, managing them by hand creates real problems:

  • Resilience: a single VM is a single point of failure. A MIG survives individual VM failures without any manual intervention.
  • Repeatability: every VM starts from the same template, so there is no configuration drift between instances over time.
  • Scaling: traffic spikes are handled by adding VMs, not by manually provisioning them under pressure.
  • Safer deployments: rolling updates replace VMs one batch at a time, so a bad deployment does not take down your entire service at once.
  • Less operational overhead: the group self-heals. You do not need to monitor for crashed VMs and recreate them by hand.

How managed instance groups work

The flow from initial setup to a running, self-healing fleet:

  1. Create an instance template. The template defines machine type, OS image, disk size, network tags, service account, and startup script. Every VM in the group is created from this spec. See Instance Templates for the full guide.
  2. Create the managed instance group. You specify the template, the desired number of VMs, and whether the group is zonal or regional.
  3. GCP provisions the VMs. The group creates the requested number of VMs. Each one starts up, runs the startup script, and becomes available.
  4. Attach a health check. The group polls each VM on a port and path you define. A VM that stops responding is marked unhealthy and replaced.
  5. Optionally attach autoscaling. The group can grow and shrink automatically based on CPU utilisation, load balancer traffic metrics, or a schedule. See Autoscaling Instance Groups.
  6. Deploy updates via a new template version. When you need to change the OS image, machine type, or startup script, create a new template and trigger a rolling update. The group replaces VMs in batches.
  7. The group maintains desired state continuously. If a VM is deleted, fails a health check, or is preempted, the group creates a replacement from the current template.
Analogy

The MIG behaves like a self-repairing assembly line. The blueprint (instance template) never changes mid-run. If a station breaks down (VM fails), the line automatically brings in a replacement built to the same blueprint. When you want to update the blueprint, you swap it in gradually — one station at a time — rather than stopping the whole line at once.

Zonal vs regional managed instance groups

The most important decision when creating a MIG is whether it is zonal or regional. This determines your availability posture directly.

Zonal MIGRegional MIG
ScopeAll VMs in one zoneVMs spread across 2–3 zones in a region
Zone failure impactAll VMs go downVMs in other zones continue serving traffic
ComplexitySimpler to reason aboutSlightly more setup (load balancer recommended)
Best use caseDev/test, non-critical workloadsProduction services, SLA-sensitive workloads
RecommendationFine for experimentationUse this for production
Tip

If your service needs to stay up during a zone outage, always use a regional MIG. The extra setup is minimal and the protection is automatic. For deeper guidance on designing around zone failures, see Designing Highly Available Systems.

Creating a managed instance group

You need an instance template before creating a MIG. The commands below assume you already have one named web-server-template. For a walkthrough of creating your first VM and template from scratch, see Creating Your First VM.

# Create a zonal MIG with 3 VMs
gcloud compute instance-groups managed create web-mig \
  --template=web-server-template \
  --size=3 \
  --zone=us-central1-a

# Create a regional MIG spread across all zones in a region
gcloud compute instance-groups managed create web-mig-regional \
  --template=web-server-template \
  --size=6 \
  --region=us-central1

# Check VM status in the group
gcloud compute instance-groups managed list-instances web-mig \
  --zone=us-central1-a

For a regional MIG, GCP distributes VMs evenly across zones automatically. A group of 6 in us-central1 will typically place 2 VMs in each of the three zones.

MIG patternBest fitRelated guide
Zonal MIGDev/test or non-critical single-zone workloads.Creating Your First VM
Regional MIGProduction services requiring zone-failure resilience.Designing Highly Available Systems
Autoscaled MIGTraffic that varies by hour or day.Autoscaling Instance Groups
Spot VM MIGFault-tolerant workloads using discounted capacity.Preemptible and Spot VMs

Autohealing with health checks

Without a health check, the MIG only replaces VMs that crash at the hypervisor level. That misses a much more common failure mode: the VM is running but the application inside it is broken. A deadlocked process, an OOM-killed app, or a service that stalled during startup will all pass a basic liveness check while silently failing to serve requests.

Attaching an HTTP health check closes this gap. The MIG polls a specific port and path on each VM. If a VM stops responding correctly, the group deletes it and creates a replacement.

# Create an HTTP health check
gcloud compute health-checks create http web-health-check \
  --port=80 \
  --request-path=/health \
  --check-interval=10s \
  --timeout=5s \
  --healthy-threshold=2 \
  --unhealthy-threshold=3

# Attach the health check to the MIG with an initial delay
gcloud compute instance-groups managed update web-mig \
  --zone=us-central1-a \
  --health-check=web-health-check \
  --initial-delay=60s

The —initial-delay=60s gives new VMs time to finish booting and starting the application before health checks begin evaluating them. Set it to comfortably exceed your application’s actual startup time.

Warning

If the initial delay is shorter than your application’s startup time, the health check fires before the app is ready, the VM is marked unhealthy, the MIG deletes it and creates a replacement, which also gets deleted — a replacement loop. Measure your actual startup time and add a buffer of at least 20–30 seconds on top of it.

Note

MIG health checks and load balancer health checks are separate resources, but they serve similar roles. It is common to reuse the same check definition for both. See HTTP Load Balancer Setup for how MIGs integrate with GCP load balancers to distribute traffic across a healthy VM fleet.

Rolling updates

To deploy a new application version, create a new instance template with the updated configuration — new OS image, updated startup script, different machine type, whatever changed. Then trigger a rolling update to point the MIG at the new template.

The group replaces VMs in batches, not all at once. Two flags control the pace:

  • max-unavailable: how many VMs can be down simultaneously during the rollout. Lower is safer but slower.
  • max-surge: how many extra VMs above target size can be created temporarily during the rollout. Higher values speed up the rollout at additional cost.
# Start a rolling update to a new template
gcloud compute instance-groups managed rolling-action start-update web-mig \
  --zone=us-central1-a \
  --version=template=web-server-template-v2 \
  --max-unavailable=1 \
  --max-surge=1

# Watch the update status
gcloud compute instance-groups managed describe web-mig \
  --zone=us-central1-a

# Wait until the update is complete
gcloud compute instance-groups managed wait-until web-mig \
  --version-target-reached \
  --zone=us-central1-a
Tip

For a zero-downtime rollout, set —max-unavailable=0. The MIG will create new VMs and wait for them to pass the health check before removing any old ones. This is slower and temporarily increases your VM count, but no capacity is lost at any point during the update.

Manual resizing

When autoscaling is not attached, you can resize the group manually. For workloads with variable traffic, autoscaling is usually a better approach than adjusting size by hand.

# Scale up to 5 VMs
gcloud compute instance-groups managed resize web-mig \
  --size=5 \
  --zone=us-central1-a

# Scale down to 2 VMs
gcloud compute instance-groups managed resize web-mig \
  --size=2 \
  --zone=us-central1-a

When to use managed instance groups

MIGs are the right choice when you need a repeatable, self-healing fleet of VMs. Common use cases:

  • Stateless web applications on Compute Engine: API servers, web frontends, reverse proxies that can run on any VM in the group.
  • Backend services behind a load balancer: MIGs register directly with GCP external and internal load balancers, making horizontal scaling straightforward.
  • Background workers and queue processors: fleets of VMs pulling tasks from Pub/Sub or Cloud Tasks, where individual VM failures are normal and expected.
  • Cost-optimised fleets using Spot VMs: MIGs handle Spot VM preemptions automatically, making them practical for fault-tolerant workloads at 60–91% lower cost.
  • Workloads that need VM-level control: specific OS configurations, custom kernel modules, or software that cannot run in a container.

When not to use them

Warning

Standard MIGs treat every VM as disposable. Any data written to VM local storage is permanently lost when a VM is replaced. If your application stores state on disk and you have not explicitly designed around this, a MIG will cause data loss.

  • One-off or long-lived “pet” VMs: if you are running a single VM for a specific purpose with no intention to scale it, a standalone VM is simpler and more appropriate.
  • Stateful workloads without careful design: use Cloud SQL, Filestore, or Cloud Storage for state, or look at stateful MIG policies for specific use cases.
  • Container-native applications: if you are already building with containers, Cloud Run or GKE are better fits and require less infrastructure management. See Cloud Run vs Compute Engine for a direct comparison.

Managed instance groups vs single VMs

Single VMManaged Instance Group
Fault toleranceOne failure takes the service downFailed VMs replaced automatically
ScalingManual resize or add a second VMHorizontal auto or manual scaling
DeploymentSSH in, run commands, or redeploy manuallyRolling update via new template version
Configuration consistencyCan drift over time from manual changesEnforced by immutable instance template
Best forOne-off tasks, dev/test, single-purpose VMsProduction services, repeatable VM fleets

GCP also has unmanaged instance groups, which let you group arbitrary existing VMs together for load balancing purposes. They have no autohealing, no rolling updates, and no autoscaling. They exist mainly for legacy use cases where VMs cannot all be identical. For new infrastructure, use managed instance groups.

Common beginner mistakes

  1. Not attaching a health check. Without a health check, autohealing only triggers on hypervisor-level crashes. A deadlocked application that is still running at the OS level will not be replaced. Always attach an HTTP health check for application-level autohealing.

  2. Using a zonal MIG for production workloads. A zone outage takes down your entire service. Use a regional MIG to spread VMs across multiple zones. The protection is automatic once the group is regional.

  3. Manually modifying VMs inside a MIG. The MIG detects the discrepancy and recreates the VM to match the instance template. Treat MIG VMs as immutable. All configuration changes go through a new template version and a rolling update.

    Danger

    SSHing into a MIG VM and changing configuration directly is one of the most common ways teams get into trouble. The MIG will overwrite your changes on the next replacement. Worse, it can cause inconsistency between VMs in the same group until they are all recycled.

  4. Setting the initial delay too short. If your application takes 60 seconds to fully start and the initial delay is 10 seconds, the health check fires before the app is ready, the VM is marked unhealthy, and the MIG replaces it in a loop. Set the initial delay to comfortably exceed your application’s actual startup time.

  5. Confusing the MIG with the load balancer or autoscaler. The MIG, load balancer, health check, and autoscaler are separate GCP resources configured independently. The MIG manages the VM fleet. The load balancer distributes traffic. The autoscaler adjusts group size. Health checks are used by both the MIG (for autohealing) and the load balancer (for routing decisions).

Frequently asked questions

What is a managed instance group in GCP?

A managed instance group (MIG) is a set of identical Compute Engine VMs created from a single instance template and managed as a unit. The group maintains a desired VM count, automatically replaces unhealthy VMs, supports rolling updates to deploy new versions without downtime, and integrates with load balancers and autoscaling. MIGs are the standard pattern for running stateless services at scale on Compute Engine.

What is the difference between a zonal and regional managed instance group?

A zonal MIG runs all VMs in a single zone. A regional MIG spreads VMs across multiple zones in a region. If one zone goes down, VMs in other zones continue serving traffic. Use regional MIGs for production workloads. Use zonal MIGs for dev/test or situations where zone redundancy is not required.

Do managed instance groups require a load balancer?

No. A MIG can run without a load balancer. You can use MIGs for background workers, batch processing, or any VM fleet that does not receive HTTP traffic. A load balancer is optional, but MIGs integrate directly with GCP load balancers when you need to distribute traffic across VMs.

How does autohealing work in a MIG?

Autohealing requires an attached HTTP or TCP health check. If a VM fails the check — for example, stops responding on the expected port — the MIG deletes it and creates a replacement from the instance template. Without a health check, the MIG only replaces VMs that crash at the hypervisor level. A deadlocked application that is still running at the OS level will not be replaced unless you use an HTTP health check.

Are managed instance groups suitable for stateful workloads?

Standard MIGs are designed for stateless workloads where any VM can be replaced without data loss. GCP also offers stateful MIGs with stateful policies that preserve specific disks and metadata across replacements. For most stateful applications, managed database services like Cloud SQL or Spanner are a better fit than running stateful services inside a MIG.

Last verified: 22 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.