Managed Instance Groups in GCP: Autohealing, Rolling Updates, and Scaling
A managed instance group runs a fleet of identical Compute Engine VMs as a single managed unit. Instead of creating and maintaining individual VMs, you define the configuration once in an instance template and let the group handle the rest: replacing unhealthy VMs, scaling with traffic, and deploying updates gradually without downtime.
Simple explanation
If you have never used a managed instance group before, here is the core idea:
- A MIG is a group of identical VMs, all created from one instance template.
- You tell the group how many VMs you want. GCP keeps that count maintained.
- If a VM becomes unhealthy or crashes, the group deletes it and creates a new one automatically.
- When you want to deploy an update, you create a new template version and the group replaces VMs gradually. Old ones go down, new ones come up.
- MIGs work best for stateless applications: web servers, API backends, background workers. Anything where it does not matter which specific VM handles a given request.
Think of a MIG like a vending machine stocked with identical items. You define the item spec once (the instance template). The machine (the MIG) keeps the right number in stock. If one item is found faulty, it is discarded and replaced automatically. You never interact with individual items — you just set the desired quantity and the machine handles the rest.
Why managed instance groups matter
Running VMs individually works fine for one or two machines. Once you have three or more doing the same job, managing them by hand creates real problems:
- Resilience: a single VM is a single point of failure. A MIG survives individual VM failures without any manual intervention.
- Repeatability: every VM starts from the same template, so there is no configuration drift between instances over time.
- Scaling: traffic spikes are handled by adding VMs, not by manually provisioning them under pressure.
- Safer deployments: rolling updates replace VMs one batch at a time, so a bad deployment does not take down your entire service at once.
- Less operational overhead: the group self-heals. You do not need to monitor for crashed VMs and recreate them by hand.
How managed instance groups work
The flow from initial setup to a running, self-healing fleet:
- Create an instance template. The template defines machine type, OS image, disk size, network tags, service account, and startup script. Every VM in the group is created from this spec. See Instance Templates for the full guide.
- Create the managed instance group. You specify the template, the desired number of VMs, and whether the group is zonal or regional.
- GCP provisions the VMs. The group creates the requested number of VMs. Each one starts up, runs the startup script, and becomes available.
- Attach a health check. The group polls each VM on a port and path you define. A VM that stops responding is marked unhealthy and replaced.
- Optionally attach autoscaling. The group can grow and shrink automatically based on CPU utilisation, load balancer traffic metrics, or a schedule. See Autoscaling Instance Groups.
- Deploy updates via a new template version. When you need to change the OS image, machine type, or startup script, create a new template and trigger a rolling update. The group replaces VMs in batches.
- The group maintains desired state continuously. If a VM is deleted, fails a health check, or is preempted, the group creates a replacement from the current template.
The MIG behaves like a self-repairing assembly line. The blueprint (instance template) never changes mid-run. If a station breaks down (VM fails), the line automatically brings in a replacement built to the same blueprint. When you want to update the blueprint, you swap it in gradually — one station at a time — rather than stopping the whole line at once.
Zonal vs regional managed instance groups
The most important decision when creating a MIG is whether it is zonal or regional. This determines your availability posture directly.
| Zonal MIG | Regional MIG | |
|---|---|---|
| Scope | All VMs in one zone | VMs spread across 2–3 zones in a region |
| Zone failure impact | All VMs go down | VMs in other zones continue serving traffic |
| Complexity | Simpler to reason about | Slightly more setup (load balancer recommended) |
| Best use case | Dev/test, non-critical workloads | Production services, SLA-sensitive workloads |
| Recommendation | Fine for experimentation | Use this for production |
If your service needs to stay up during a zone outage, always use a regional MIG. The extra setup is minimal and the protection is automatic. For deeper guidance on designing around zone failures, see Designing Highly Available Systems.
Creating a managed instance group
You need an instance template before creating a MIG. The commands below
assume you already have one named web-server-template. For a
walkthrough of creating your first VM and template from scratch, see
Creating Your First VM.
# Create a zonal MIG with 3 VMs
gcloud compute instance-groups managed create web-mig \
--template=web-server-template \
--size=3 \
--zone=us-central1-a
# Create a regional MIG spread across all zones in a region
gcloud compute instance-groups managed create web-mig-regional \
--template=web-server-template \
--size=6 \
--region=us-central1
# Check VM status in the group
gcloud compute instance-groups managed list-instances web-mig \
--zone=us-central1-aFor a regional MIG, GCP distributes VMs evenly across zones automatically.
A group of 6 in us-central1 will typically place 2 VMs in each
of the three zones.
| MIG pattern | Best fit | Related guide |
|---|---|---|
| Zonal MIG | Dev/test or non-critical single-zone workloads. | Creating Your First VM |
| Regional MIG | Production services requiring zone-failure resilience. | Designing Highly Available Systems |
| Autoscaled MIG | Traffic that varies by hour or day. | Autoscaling Instance Groups |
| Spot VM MIG | Fault-tolerant workloads using discounted capacity. | Preemptible and Spot VMs |
Autohealing with health checks
Without a health check, the MIG only replaces VMs that crash at the hypervisor level. That misses a much more common failure mode: the VM is running but the application inside it is broken. A deadlocked process, an OOM-killed app, or a service that stalled during startup will all pass a basic liveness check while silently failing to serve requests.
Attaching an HTTP health check closes this gap. The MIG polls a specific port and path on each VM. If a VM stops responding correctly, the group deletes it and creates a replacement.
# Create an HTTP health check
gcloud compute health-checks create http web-health-check \
--port=80 \
--request-path=/health \
--check-interval=10s \
--timeout=5s \
--healthy-threshold=2 \
--unhealthy-threshold=3
# Attach the health check to the MIG with an initial delay
gcloud compute instance-groups managed update web-mig \
--zone=us-central1-a \
--health-check=web-health-check \
--initial-delay=60sThe —initial-delay=60s gives new VMs time to finish booting
and starting the application before health checks begin evaluating them.
Set it to comfortably exceed your application’s actual startup time.
If the initial delay is shorter than your application’s startup time, the health check fires before the app is ready, the VM is marked unhealthy, the MIG deletes it and creates a replacement, which also gets deleted — a replacement loop. Measure your actual startup time and add a buffer of at least 20–30 seconds on top of it.
MIG health checks and load balancer health checks are separate resources, but they serve similar roles. It is common to reuse the same check definition for both. See HTTP Load Balancer Setup for how MIGs integrate with GCP load balancers to distribute traffic across a healthy VM fleet.
Rolling updates
To deploy a new application version, create a new instance template with the updated configuration — new OS image, updated startup script, different machine type, whatever changed. Then trigger a rolling update to point the MIG at the new template.
The group replaces VMs in batches, not all at once. Two flags control the pace:
- max-unavailable: how many VMs can be down simultaneously during the rollout. Lower is safer but slower.
- max-surge: how many extra VMs above target size can be created temporarily during the rollout. Higher values speed up the rollout at additional cost.
# Start a rolling update to a new template
gcloud compute instance-groups managed rolling-action start-update web-mig \
--zone=us-central1-a \
--version=template=web-server-template-v2 \
--max-unavailable=1 \
--max-surge=1
# Watch the update status
gcloud compute instance-groups managed describe web-mig \
--zone=us-central1-a
# Wait until the update is complete
gcloud compute instance-groups managed wait-until web-mig \
--version-target-reached \
--zone=us-central1-aFor a zero-downtime rollout, set —max-unavailable=0. The MIG
will create new VMs and wait for them to pass the health check before removing
any old ones. This is slower and temporarily increases your VM count, but no
capacity is lost at any point during the update.
Manual resizing
When autoscaling is not attached, you can resize the group manually. For workloads with variable traffic, autoscaling is usually a better approach than adjusting size by hand.
# Scale up to 5 VMs
gcloud compute instance-groups managed resize web-mig \
--size=5 \
--zone=us-central1-a
# Scale down to 2 VMs
gcloud compute instance-groups managed resize web-mig \
--size=2 \
--zone=us-central1-aWhen to use managed instance groups
MIGs are the right choice when you need a repeatable, self-healing fleet of VMs. Common use cases:
- Stateless web applications on Compute Engine: API servers, web frontends, reverse proxies that can run on any VM in the group.
- Backend services behind a load balancer: MIGs register directly with GCP external and internal load balancers, making horizontal scaling straightforward.
- Background workers and queue processors: fleets of VMs pulling tasks from Pub/Sub or Cloud Tasks, where individual VM failures are normal and expected.
- Cost-optimised fleets using Spot VMs: MIGs handle Spot VM preemptions automatically, making them practical for fault-tolerant workloads at 60–91% lower cost.
- Workloads that need VM-level control: specific OS configurations, custom kernel modules, or software that cannot run in a container.
When not to use them
Standard MIGs treat every VM as disposable. Any data written to VM local storage is permanently lost when a VM is replaced. If your application stores state on disk and you have not explicitly designed around this, a MIG will cause data loss.
- One-off or long-lived “pet” VMs: if you are running a single VM for a specific purpose with no intention to scale it, a standalone VM is simpler and more appropriate.
- Stateful workloads without careful design: use Cloud SQL, Filestore, or Cloud Storage for state, or look at stateful MIG policies for specific use cases.
- Container-native applications: if you are already building with containers, Cloud Run or GKE are better fits and require less infrastructure management. See Cloud Run vs Compute Engine for a direct comparison.
Managed instance groups vs single VMs
| Single VM | Managed Instance Group | |
|---|---|---|
| Fault tolerance | One failure takes the service down | Failed VMs replaced automatically |
| Scaling | Manual resize or add a second VM | Horizontal auto or manual scaling |
| Deployment | SSH in, run commands, or redeploy manually | Rolling update via new template version |
| Configuration consistency | Can drift over time from manual changes | Enforced by immutable instance template |
| Best for | One-off tasks, dev/test, single-purpose VMs | Production services, repeatable VM fleets |
GCP also has unmanaged instance groups, which let you group arbitrary existing VMs together for load balancing purposes. They have no autohealing, no rolling updates, and no autoscaling. They exist mainly for legacy use cases where VMs cannot all be identical. For new infrastructure, use managed instance groups.
Common beginner mistakes
Not attaching a health check. Without a health check, autohealing only triggers on hypervisor-level crashes. A deadlocked application that is still running at the OS level will not be replaced. Always attach an HTTP health check for application-level autohealing.
Using a zonal MIG for production workloads. A zone outage takes down your entire service. Use a regional MIG to spread VMs across multiple zones. The protection is automatic once the group is regional.
Manually modifying VMs inside a MIG. The MIG detects the discrepancy and recreates the VM to match the instance template. Treat MIG VMs as immutable. All configuration changes go through a new template version and a rolling update.
DangerSSHing into a MIG VM and changing configuration directly is one of the most common ways teams get into trouble. The MIG will overwrite your changes on the next replacement. Worse, it can cause inconsistency between VMs in the same group until they are all recycled.
Setting the initial delay too short. If your application takes 60 seconds to fully start and the initial delay is 10 seconds, the health check fires before the app is ready, the VM is marked unhealthy, and the MIG replaces it in a loop. Set the initial delay to comfortably exceed your application’s actual startup time.
Confusing the MIG with the load balancer or autoscaler. The MIG, load balancer, health check, and autoscaler are separate GCP resources configured independently. The MIG manages the VM fleet. The load balancer distributes traffic. The autoscaler adjusts group size. Health checks are used by both the MIG (for autohealing) and the load balancer (for routing decisions).
Summary
- A managed instance group runs a fleet of identical VMs from a single instance template and maintains desired state automatically
- Regional MIGs spread VMs across multiple zones — use them for production to survive zone outages
- Autohealing requires an attached HTTP health check; set an initial delay that covers your application’s full startup time
- Rolling updates replace VMs in batches — control speed and safety trade-offs with max-unavailable and max-surge
- Never manually modify VMs inside a MIG — always apply changes through a new template version and rolling update
- MIGs are best for stateless, repeatable VM fleets; for containers, Cloud Run or GKE are usually a better fit
Frequently asked questions
What is a managed instance group in GCP?
A managed instance group (MIG) is a set of identical Compute Engine VMs created from a single instance template and managed as a unit. The group maintains a desired VM count, automatically replaces unhealthy VMs, supports rolling updates to deploy new versions without downtime, and integrates with load balancers and autoscaling. MIGs are the standard pattern for running stateless services at scale on Compute Engine.
What is the difference between a zonal and regional managed instance group?
A zonal MIG runs all VMs in a single zone. A regional MIG spreads VMs across multiple zones in a region. If one zone goes down, VMs in other zones continue serving traffic. Use regional MIGs for production workloads. Use zonal MIGs for dev/test or situations where zone redundancy is not required.
Do managed instance groups require a load balancer?
No. A MIG can run without a load balancer. You can use MIGs for background workers, batch processing, or any VM fleet that does not receive HTTP traffic. A load balancer is optional, but MIGs integrate directly with GCP load balancers when you need to distribute traffic across VMs.
How does autohealing work in a MIG?
Autohealing requires an attached HTTP or TCP health check. If a VM fails the check — for example, stops responding on the expected port — the MIG deletes it and creates a replacement from the instance template. Without a health check, the MIG only replaces VMs that crash at the hypervisor level. A deadlocked application that is still running at the OS level will not be replaced unless you use an HTTP health check.
Are managed instance groups suitable for stateful workloads?
Standard MIGs are designed for stateless workloads where any VM can be replaced without data loss. GCP also offers stateful MIGs with stateful policies that preserve specific disks and metadata across replacements. For most stateful applications, managed database services like Cloud SQL or Spanner are a better fit than running stateful services inside a MIG.