GCP Managed Instance Group Autoscaling Explained for Beginners
Autoscaling lets your infrastructure grow when traffic spikes and shrink when it drops, so you pay for what you use rather than what you might need. This page covers how autoscaling works in Compute Engine managed instance groups, how to configure the most common scaling signals, and how to avoid the mistakes that cause over-provisioning, flapping, or failed scale-out.
Simple explanation
A managed instance group is a collection of identical VMs created from the same instance template. Every VM in the group runs the same image, the same startup script, and the same configuration. This uniformity is what makes autoscaling possible.
The autoscaler is a GCP service that watches a signal you choose, usually CPU utilisation, request rate, or queue depth, and adjusts the number of VMs in the group to keep that signal close to a target you set. When demand rises, it creates more VMs from the template. When demand falls, it removes VMs gradually.
The result: your service handles traffic spikes without you manually provisioning capacity, and you stop paying for idle VMs during quiet periods. Autoscaling works best when your VMs are interchangeable and stateless. Any VM can handle any request, and no critical data lives on the VM’s local disk.
Think of the autoscaler as a thermostat for your server fleet. You set a target temperature (60% CPU) and a range (min 2 VMs, max 10 VMs). When the room gets too hot, it adds VMs to cool things down. When it is cold and quiet at 3am, it turns some VMs off to save money. You set the rules once and the thermostat handles it from there.
Why autoscaling matters
Without autoscaling, you face two bad options: provision for peak traffic (wasteful) or provision for average traffic (risky). A fixed fleet sized for average load falls over during spikes. A fixed fleet sized for peak load wastes money every night, weekend, and off-season.
Autoscaling gives you a third option: size the group for the current moment. This matters for cost, but also for reliability. A group that can grow under load is far more resilient than one that is permanently undersized or one where you need to remember to manually scale before every peak period.
How autoscaling works
The autoscaler runs a continuous decision loop. Every few seconds it evaluates the current value of your chosen signal against the target you set, calculates how many VMs would bring the signal back to target, then compares that recommendation against your minimum and maximum replica counts. If the numbers differ, it sends a resize request to the MIG.
Three numbers define the autoscaler’s operating range:
Min replicas — the group never shrinks below this count. Set to at least 2 in production so the group survives a rolling update or VM replacement without hitting zero.
Max replicas — the group never grows beyond this count. Set based on downstream capacity (database connections, external API rate limits), not just expected peak traffic.
Target signal — the metric value the autoscaler tries to maintain. For CPU, this is a fraction like 0.60 (60% average CPU). For request rate, it is requests per second per VM.
Scale-out: adding VMs
When the autoscaler decides to add VMs, it creates them from the instance template. New VMs run your startup script, then the health check must pass before the VM is considered ready. Scale-out takes as long as your startup script plus the initial health check delay. It is not instant.
The cool-down period tells the autoscaler how long to wait after a scale-out event before evaluating whether to add more VMs. Set it to at least your application’s startup time. If you set it too short, the autoscaler sees the boot-time CPU spike on new VMs and adds another wave before the first batch is serving — causing a runaway scale-out cascade.
Scale-in: removing VMs
Scale-in is deliberately slower than scale-out to avoid disrupting active connections. The autoscaler waits for sustained low utilisation before removing VMs. You can further limit scale-in speed with a scale-in control that caps how many VMs are removed per time window.
Quotas are real ceilings. The autoscaler will try to create VMs up to your max replicas, but if your regional CPU quota is lower, scale-out events will partially fail. Check your GCP quotas before setting a high max, and request increases before you actually need them.
Prerequisites
You need a managed instance group before you can configure autoscaling. If you have not set one up yet, start with the managed instance groups guide, which covers creating a MIG, setting up autohealing, and running rolling updates. The autoscaler is then configured on top of the existing MIG.
When to use autoscaling
Web services with variable traffic. A service that handles 50 requests per second on Tuesday morning and 500 on Friday afternoon is a natural fit. The group scales up before the peak and down afterward.
Internal tools with working-hour demand. Services that are busy 9am to 6pm and idle overnight benefit from scheduled scaling to pre-warm capacity at the start of the day, with metric-based scaling on top for unexpected demand.
Queue workers processing a backlog. Scale worker VMs based on Pub/Sub backlog depth. When messages accumulate, add workers. When the queue drains, remove them. This is one of the cleanest autoscaling patterns because the signal directly represents work to do.
Batch processing with variable job size. Jobs that arrive irregularly can spin up a large group, process quickly, and scale back to min replicas between runs.
Cost optimisation on fault-tolerant workloads. Combine autoscaling with Spot VMs to reduce compute cost significantly. Spot VMs can be reclaimed by GCP with short notice, so they only suit workloads that can tolerate interruption.
When not to use autoscaling
Do not autoscale stateful VMs without offloading state first. Scale-in removes VMs. When a VM is removed, its local disk is destroyed. Session data, uploaded files, or any state stored locally will be permanently lost. Move state to Cloud Storage, Cloud SQL, or Firestore before enabling autoscaling.
Applications with long or unpredictable startup times. If a VM takes 10 minutes to start and be ready to serve, autoscaling reacts too slowly to be useful for traffic spikes. Fix the startup time first.
Systems bottlenecked by a fixed downstream dependency. If your database handles 100 concurrent connections and you scale VMs to 200, the extra VMs will cause connection pool exhaustion without improving throughput. Autoscaling helps with compute capacity, not with bottlenecks elsewhere in the stack.
Small, static workloads. A service with steady, predictable load at a known VM count does not benefit from autoscaling. A fixed-size group is simpler to reason about and carries no risk of unexpected scale-out costs.
The four autoscaling signals
CPU utilisation
The simplest signal. The autoscaler tracks average CPU across all VMs and adds or removes VMs to keep that average near your target. Works well for compute-bound workloads where CPU closely tracks actual load. Target 60 to 70% rather than 100%. At 100% target, the service is already degraded before new VMs arrive.
HTTP load balancing serving capacity
Scales based on requests per second per VM, or as a fraction of the VM’s defined serving capacity. Best for web services sitting behind a GCP HTTP(S) load balancer. This signal tracks actual request pressure rather than CPU, which is more accurate for I/O-heavy services where CPU stays low even under heavy load.
Cloud Monitoring metrics
Scale on any metric available in Cloud Monitoring, including custom metrics you publish yourself. The most common use is scaling queue workers based on Pub/Sub subscription backlog depth. You can also use this for memory-based scaling, database queue depth, or any application-level metric that reflects real work waiting to be done.
Schedules
Set a minimum VM count for a specific time window, defined as a cron expression. The autoscaler can still scale above the scheduled minimum based on live metrics. The schedule just prevents scaling below it during the window. Useful for pre-warming capacity before a known traffic peak rather than waiting for the signal to climb.
Not sure which signal to pick? Start with CPU if your service is compute-heavy. Use the load balancer signal if your service is I/O-heavy and CPU stays low under load. Use a Cloud Monitoring metric for queue workers. Add a schedule on top if you have predictable daily traffic patterns. You can combine signals, and the autoscaler will use whichever recommends the most VMs at any moment.
Implementation examples
Basic CPU autoscaling
Configure autoscaling on an existing MIG. The group will scale between 2 and 10 VMs, targeting 60% average CPU. The 90-second cool-down prevents over-scaling during the startup period after a scale-out event.
gcloud compute instance-groups managed set-autoscaling web-mig \
--zone=us-central1-a \
--min-num-replicas=2 \
--max-num-replicas=10 \
--target-cpu-utilization=0.60 \
--cool-down-period=90# Check current autoscaler status
gcloud compute instance-groups managed describe web-mig \
--zone=us-central1-a \
--format="value(autoscaler.status)"Cloud Monitoring metric: Pub/Sub backlog
Scale a worker group based on how many undelivered messages are in a Pub/Sub subscription. A target of 100 means the autoscaler tries to keep roughly 100 messages per VM. As the backlog grows, more VMs are added. As it drains, VMs are removed.
gcloud compute instance-groups managed set-autoscaling worker-mig \
--zone=us-central1-a \
--min-num-replicas=1 \
--max-num-replicas=20 \
--update-stackdriver-metric=pubsub.googleapis.com/subscription/num_undelivered_messages \
--stackdriver-metric-utilization-target=100 \
--stackdriver-metric-utilization-target-type=gaugeScheduled scaling for predictable traffic
Pre-warm to 10 VMs at 7am every weekday (UTC) ahead of expected morning traffic. The group stays at this minimum for 10 hours, after which the metric signals take back control. The second schedule sets a lower floor for the overnight window.
Scheduled scaling is like a restaurant calling in extra staff before the lunch rush instead of waiting until the queue is out the door. You know demand is coming, so you pre-warm capacity. The autoscaler then handles unexpected surges on top of that baseline — like calling in emergency staff if a large unexpected party arrives.
# Scale up to 10 VMs at 7am UTC every weekday
gcloud compute instance-groups managed update-autoscaling web-mig \
--zone=us-central1-a \
--set-schedule=morning-scale-up \
--schedule-cron="0 7 * * 1-5" \
--schedule-duration-sec=36000 \
--schedule-min-required-replicas=10 \
--schedule-time-zone="UTC"
# Set a lower floor for the overnight window
gcloud compute instance-groups managed update-autoscaling web-mig \
--zone=us-central1-a \
--set-schedule=evening-scale-down \
--schedule-cron="0 19 * * 1-5" \
--schedule-duration-sec=43200 \
--schedule-min-required-replicas=2 \
--schedule-time-zone="UTC"Scale-in control
Limit how quickly the group shrinks to avoid disrupting active connections. This setting caps scale-in at one VM per five minutes, regardless of what the metric signal recommends.
gcloud compute instance-groups managed update web-mig \
--zone=us-central1-a \
--scale-in-control=max-scaled-in-replicas=1,time-window=300Autoscaling vs autohealing
These two features are often confused because they are both configured on a MIG and both involve VM lifecycle decisions. They solve different problems.
Autoscaling changes how many VMs are in the group. It responds to load signals and works within your min/max bounds.
Autohealing replaces individual VMs that fail a health check, regardless of the group’s overall size. If one VM crashes or becomes unresponsive, autohealing deletes it and creates a replacement without changing the total group size.
A production MIG typically needs both. Autoscaling handles demand. Autohealing handles failures within that capacity. Without autoscaling, unhealthy VMs stay in the group and receive traffic. Without autohealing, the group cannot grow under load.
Set the —initial-delay on your health check to be longer
than your application’s startup time. If the delay is too short,
autohealing will see a freshly created VM fail its health check during
boot and delete it, which triggers another replacement that gets deleted
too. You end up in a boot loop. Set the initial delay conservatively and
tighten it only after measuring real startup durations.
See the MIG guide
for full health check configuration.
MIG autoscaling vs Cloud Run scaling
Both MIG autoscaling and Cloud Run scale workloads based on demand, but they operate at different levels of abstraction and suit different use cases.
MIG autoscaling works at the VM level. You control the OS, the runtime, the startup script, disk configuration, and machine type. Scaling happens in minutes. You manage more, but you get more control. Useful for workloads that need specific OS dependencies, GPU access, or persistent disk configuration.
Cloud Run scaling works at the container level. You package your app as a container and Cloud Run handles everything else. Scaling happens in seconds. You do not manage VMs, OS patches, or startup scripts. See Cloud Run scaling behaviour for how it works.
If your workload is a containerised, stateless HTTP service, Cloud Run is usually simpler and faster to scale. If your workload needs VM-level control, such as specific machine types, GPU, Windows OS, persistent disks, or a long-lived process, MIG autoscaling is the right tool. See choosing between Cloud Run, GKE, and Compute Engine VMs for a broader comparison.
Common mistakes
Setting max replicas higher than your quota allows. The autoscaler will try to reach the maximum. If your regional CPU quota is lower, scale-out events partially fail and the group ends up undersized under load. Check your GCP quotas and request increases before you need them.
Setting the cool-down period shorter than startup time. If your app takes 120 seconds to start and the cool-down is 30 seconds, the autoscaler sees high CPU from booting VMs and adds more before the first batch is ready, causing a scale-out cascade. Set cool-down to at least your startup time.
Targeting 100% CPU utilisation. At a 100% target, the autoscaler only adds VMs after every existing VM is saturated. Requests are already degraded by then. Target 60 to 70% to leave headroom while new VMs are starting.
Autoscaling stateful VMs without offloading state. Scale-in removes VMs and destroys their local disks. If your VMs hold session data or uploaded files locally, scale-in loses it permanently. Move state to Cloud Storage, Cloud SQL, or Firestore before enabling autoscaling.
Ignoring downstream bottlenecks. Scaling VMs to 100 instances does not help if your database has a 50-connection limit. The extra VMs will exhaust the connection pool and make things worse. Know the capacity limits of every dependency before setting your max replicas.
Forgetting the health check initial delay. A health check with an initial delay shorter than VM startup time causes autohealing to delete new VMs during boot. Set the initial delay conservatively and tighten it only after measuring real startup times.
Choosing the wrong scaling signal. CPU is not always the right signal. An I/O-bound service may handle many requests at low CPU. A queue worker should scale on queue depth, not CPU. Match the signal to what actually represents work for your workload.
Best practices for production
Keep instance templates immutable. Never modify a template in place once the MIG is using it. Create a new template version and do a rolling update. This preserves the ability to roll back cleanly.
Use at least 2 min replicas for production services. A single-VM group goes to zero during a rolling update or replacement, causing downtime. Two instances keep the service alive while one is replaced.
Validate quotas before raising max replicas. Request quota increases in advance if you plan to scale large. Quota requests can take hours to process and you do not want to hit the ceiling during a traffic event.
Combine autoscaling with a load balancer. The autoscaler changes VM count, but traffic only reaches healthy VMs if a load balancer is distributing it. An HTTP load balancer also unlocks the load balancing serving capacity signal.
Monitor scale events and startup times. Use Cloud Monitoring to track autoscaler decisions, VM startup durations, and instance counts over time. Alert on unexpected scale-out events that could indicate runaway scaling.
Test scaling under realistic load. Run load tests before production traffic hits. Verify that scale-out reaches the expected instance count, that cool-down prevents cascades, and that startup time stays within your cool-down budget.
Keep apps stateless where possible. Any VM should be able to handle any request. Store sessions in a cache or database, not in VM memory or local disk. Stateless VMs are safely replaceable, which is the foundation autoscaling relies on.
Protect downstream dependencies. Set max replicas conservatively if your database, external API, or messaging system has a hard capacity ceiling. Coordinate with the owners of those systems before scaling large.
Summary
- Autoscaling adjusts how many VMs run in a managed instance group based on a signal you define: CPU, request rate, queue depth, or a schedule.
- Set min replicas (at least 2 for production), max replicas (based on downstream capacity), and a target signal value.
- Set the cool-down period to at least your VM startup time to prevent scale-out cascades during boot.
- Use scale-in control to limit how quickly the group shrinks and protect active connections.
- Autoscaling and autohealing are complementary. Autoscaling manages quantity; autohealing manages individual VM health.
- Check quotas and downstream limits before setting a high max. The autoscaler will try to reach it.
- Autoscaling works best with stateless, interchangeable VMs. Move state to managed storage before enabling it.
Frequently asked questions
What metrics can trigger autoscaling in a managed instance group?
Four signal types: CPU utilisation (scale when average CPU across the group exceeds a target), HTTP load balancing serving capacity (requests per second per VM), Cloud Monitoring metrics (any metric including custom ones, commonly Pub/Sub subscription backlog), and schedules (scale to a set size at a set time). You can combine multiple signals. When signals conflict, the autoscaler picks whichever recommends the most VMs.
What is a good CPU target for Compute Engine autoscaling?
60 to 70 percent is a common starting point. Targeting 100% means every existing VM is fully saturated before the autoscaler adds more. At that point requests are already degrading. A target of 60% leaves room for new VMs to start and begin serving before the existing ones hit their limit.
What is the difference between autoscaling and autohealing?
Autoscaling changes how many VMs are in the group based on load. Autohealing replaces individual VMs that fail a health check, regardless of group size. Production MIGs typically need both: autoscaling handles demand, autohealing handles failures within that capacity.
How fast does a MIG autoscaler scale out?
Scale-out speed depends mainly on your VM startup time. The autoscaler makes a decision within seconds, but new VMs typically become ready to serve in 1 to 3 minutes once the startup script finishes and the health check passes. Set the cool-down period to at least your app startup time so the autoscaler does not add another wave of VMs before the first batch is serving.
Can autoscaling work with stateful workloads?
With care. Scale-in removes VMs and destroys their local disk state. If your VMs store critical session data or files locally, scale-in will lose it. For most stateful workloads, move state to Cloud SQL, Firestore, or Cloud Storage so each VM is replaceable, then autoscaling works safely. GCP does offer stateful MIGs, but they have limited autoscaling support and are better suited to fixed-size groups.