Cloud Run Scaling Explained: Cold Starts, Min Instances, Max Instances, and Concurrency

Cloud Run scales automatically. But without understanding how, you end up with cold starts on customer-facing requests, database connection pools exhausted under load, or surprising bills. This guide explains how Cloud Run scaling actually works, what each control does, and how to choose the right settings for your specific workload.

Simple explanation

Cloud Run runs your container image and routes HTTP traffic to it. When more requests arrive than your current instances can handle, Cloud Run starts new instances. When traffic drops and instances sit idle, they are stopped. With the default configuration, a service that receives no traffic eventually has zero running instances. That is scale-to-zero.

The three numbers you control are:

Minimum instances: the floor. How many instances stay running even at zero traffic.
Maximum instances: the ceiling. How many instances Cloud Run is allowed to start.
Concurrency: how many simultaneous requests a single instance handles before Cloud Run starts another one.

Analogy

Think of Cloud Run instances as checkout tills at a supermarket. The store opens more tills as queues grow and closes idle ones when the store is quiet. Concurrency is how many customers each till can serve at once. Minimum instances is the number of tills that stay staffed even when no customers are present. Opening a closed till takes a moment before the first customer can be served. That is a cold start.

How Cloud Run scaling works

When a request arrives at your Cloud Run service, Google’s load balancer sends it to an available instance. Cloud Run tracks how many requests each instance is currently handling. When that number approaches the concurrency limit, Cloud Run starts a new instance rather than queue the request against an already-busy one.

New instances go through a cold start: Cloud Run pulls any uncached container image layers from Artifact Registry, starts the container process, and waits for your application to bind to the PORT environment variable and signal readiness. Only then can the instance accept traffic. This window (image pull to ready) is the cold start latency added to the request that triggered the new instance.

When traffic drops and instances fall below their concurrency threshold, Cloud Run marks them idle. After a cooldown window, idle instances are stopped. If min-instances is 0 and the service fully drains, all instances are eventually terminated and the service scales to zero.

The next request to arrive after scale-to-zero waits for a full cold start before getting a response. This is the fundamental trade-off: scale-to-zero saves money at idle, but the first request pays a latency penalty.

Note

Cold start duration varies significantly by workload. A Go binary in a distroless image may start in under 300ms. A Python framework that connects to a database on startup can take 3 to 5 seconds. Check the container/startup_latency metric under run.googleapis.com in Cloud Monitoring to measure yours before deciding on min-instances.

Key scaling controls

Minimum instances

Minimum instances sets the number of container instances that stay running even when there is no traffic. The default is 0, which enables scale-to-zero.

Setting —min-instances=1 keeps one warm instance ready at all times. The first request after any idle period hits a ready container instead of waiting for startup.

Tip

For most user-facing services, setting —min-instances=1 is the single most impactful scaling change you can make. A 256Mi instance using CPU-during-requests mode incurs memory billing only while idle, typically a few dollars per month. That cost eliminates the “dead service” cold start entirely.

Setting —min-instances=2 or higher also helps during rolling deployments. When you push a new revision, Cloud Run replaces instances gradually. With only one minimum instance, there is a brief window where the old instance is stopping and the new one is not yet warm. Two minimum instances removes that gap. See CI/CD pipelines for Cloud Run for how deployment strategies interact with instance configuration.

Min instances	Cold start risk	Best for
0 (default)	Yes, on first request after idle	Internal tools, webhooks, batch triggers, event handlers where callers tolerate latency
1	No, baseline always warm	User-facing APIs and websites where cold start latency is visible to end users
2+	No	Production APIs with SLOs; services needing zero-downtime during rolling deployments

Maximum instances

Maximum instances caps how many container instances Cloud Run is allowed to run simultaneously. Without a cap, Cloud Run can scale to hundreds of instances under an unexpected traffic spike. That is often not a billing problem first. It is a downstream problem.

Watch out

Leaving max-instances uncapped is the most common Cloud Run production mistake. A traffic spike can start hundreds of instances in seconds, exhausting database connection pools and causing cascading failures across your service before billing becomes a concern. Always set —max-instances explicitly.

The most common failure mode is database connection exhaustion. If each Cloud Run instance opens one database connection and your database accepts 100 connections, then instance 101 causes connection errors that cascade into 500 responses across your entire service. Set —max-instances based on what your database or upstream services can handle, not on an estimate of expected traffic.

When max-instances is reached and all instances are at concurrency capacity, Cloud Run queues incoming requests briefly. If the queue overflows or requests time out, Cloud Run returns HTTP 429. Design your clients to handle this gracefully.

Concurrency

Concurrency is the maximum number of simultaneous HTTP requests a single container instance handles. The default is 80. The maximum is 1000.

Analogy

Think of concurrency like a waiter managing tables. An I/O-bound service is a waiter who spends most of their time waiting for the kitchen to deliver food. They can comfortably manage 20 tables at once because most of the work is just waiting. A CPU-bound service is a chef who is actively cooking. Adding more orders does not help: the kitchen is already at full capacity, and everything takes longer.

For I/O-bound services (APIs that spend most of their time waiting on database responses or external HTTP calls), high concurrency is efficient. One instance handles 80 requests simultaneously, most of which are just waiting on I/O. The instance is not CPU-saturated; it is managing many open connections at once.

For CPU-bound services such as image processing, PDF generation, video transcoding, and machine learning inference, high concurrency causes contention. Each request consumes real CPU. Lowering concurrency to 1 to 5 means each instance handles fewer simultaneous requests, giving each one more CPU share and more predictable latency.

Setting concurrency to 1 makes Cloud Run behave like traditional serverless functions: one request per instance. This maximises isolation but also maximises instance count and cold start exposure under load. Only use concurrency of 1 if your workload genuinely requires it, such as when requests share global mutable state within the process.

Tip

If you are unsure whether your service is I/O-bound or CPU-bound, start with the default concurrency of 80 and load test it. Watch CPU utilisation in Cloud Run monitoring under real load. If CPU stays below 60%, the default is working well. If it regularly spikes to 90%+, lower concurrency.

CPU allocation and startup behaviour

Cloud Run has two CPU allocation modes:

CPU allocated during requests only (default): CPU is available and billed only while a request is active. Idle instances have their CPU throttled. Most HTTP services have nothing to do between requests; this mode is both cost-efficient and correct for them.
CPU always allocated: CPU is available even between requests. Required if your application needs to run background processing, polling loops, or cache warmup logic outside of request handlers. Billed continuously while any instance is running.

The —cpu-boost flag is separate from CPU allocation mode. It temporarily allocates extra CPU during container startup to help your application initialise faster. This reduces cold start latency at the cost of slightly higher billing for those startup seconds. It is worth enabling for most services that have meaningful cold start times.

# Deploy with all key scaling parameters
gcloud run deploy my-service \
  --image=IMAGE \
  --region=us-central1 \
  --min-instances=1 \
  --max-instances=50 \
  --concurrency=80 \
  --cpu-boost

# Update scaling on a running service
gcloud run services update my-service \
  --region=us-central1 \
  --min-instances=2 \
  --max-instances=100

# Enable CPU always allocated for background processing
gcloud run services update my-service \
  --region=us-central1 \
  --no-cpu-throttling

# Set concurrency to 1 for CPU-bound or fully isolated workloads
gcloud run services update my-service \
  --region=us-central1 \
  --concurrency=1

How to choose Cloud Run scaling settings

The right configuration depends on what your service does and who calls it. These scenarios cover the most common cases.

Scenario	Min instances	Max instances	Concurrency
Internal tool or admin panel	0	5–10	80
Webhook or event handler	0	20–100	80
Public API (user-facing)	1–2	Constrained by DB connections	80
Latency-sensitive website	2+	50–200	80
CPU-heavy processing	0–1	10–20	1–5
I/O-heavy integration API	1	Based on upstream limits	80–200

Database connections

Safe max-instances = database connection pool size ÷ connections opened per instance. If your database allows 200 connections and each Cloud Run instance opens 5, your ceiling is 40 instances. Exceed that and connection errors cascade into HTTP 500 responses across your entire service. A connection pooler such as PgBouncer can raise this ceiling without increasing database tier.

CPU-heavy workloads should start with concurrency of 1 and tune upward only after measuring. Use the CPU utilisation metrics visible in Cloud Run monitoring to check whether instances are CPU-saturated under real load before raising concurrency.

Spiky workloads such as triggered batch jobs, marketing email sends, and event-driven pipelines benefit from min-instances=0 for cost savings and a high enough max-instances to absorb bursts. Cold starts are acceptable here because callers are automated processes, not users waiting for a page load.

When you deploy a new Cloud Run service, start with the defaults and measure. Real traffic data is more reliable than estimates when tuning these settings.

When scaling settings actually matter

For many services, the defaults are fine. Scaling settings become critical in these situations:

Your service is user-facing. Cold starts adding one to three seconds to a page load are noticeable. Set min-instances to 1 or higher.
You have strict latency SLOs. P99 latency targets are broken by cold starts. Even a handful per hour will show up in tail latency metrics.
You share a downstream database. Without a max-instances cap, a traffic spike can instantly exhaust database connections. This is the most common scaling misconfiguration in production Cloud Run deployments.
Your traffic is highly spiky. A service that jumps from 0 to 500 requests per second triggers many simultaneous cold starts. CPU boost and small images reduce the impact, but spiky traffic always tests your scaling assumptions.
You are running CPU-bound workloads. The default concurrency of 80 saturates CPU on image processing or ML inference services. High contention manifests as rising latency under moderate load.
You are managing costs carefully. Minimum instances bill even at zero traffic. High max-instances allow short-lived cost spikes. See cost optimisation strategies for the broader picture of GCP spend management.

Common mistakes

Leaving max-instances uncapped. Without a ceiling, Cloud Run scales to hundreds of instances under unexpected traffic. This typically causes downstream connection exhaustion before it causes a billing problem. Always set —max-instances explicitly, based on what your database or upstream services can sustain.
Scale-to-zero for a user-facing API. If your API serves real users and has quiet periods overnight or on weekends, the first morning request hits a cold start. One warm minimum instance costs a few dollars per month and makes response times consistent all day.
Concurrency of 1 for I/O-bound services. Concurrency of 1 creates a new instance for every simultaneous request. For a service spending most of its time waiting on a database, this multiplies instance count, cold starts, and cost with no benefit. The default of 80 handles I/O-bound work far more efficiently.
CPU always allocated for a standard request-response service. CPU always allocated is billed continuously per running instance. Most HTTP services have nothing to do between requests. Enable it only if you have actual background work to run between request handlers.
Assuming max-instances is purely a cost control. It is primarily a downstream protection mechanism. Set it based on what your database, cache, or upstream APIs can handle, not on an estimate of peak traffic volume.
Not measuring cold start duration before configuring min-instances. Cold start time varies significantly by image size, language runtime, and initialisation logic. Check the container/startup_latency metric in Cloud Run monitoring before assuming your cold starts are acceptable or worth the cost of warming instances.
Over-warming instances. Setting min-instances=10 for a service that peaks at 5 simultaneous requests wastes money. Start at 1 or 2, observe actual concurrency under peak load in production, then tune upward only if needed.

Cloud Run scaling vs Cloud Functions and traditional VMs

	Cloud Run	Cloud Functions (1st gen)	VM autoscaling (MIGs)
Scale-to-zero	Yes (default)	Yes	No (minimum 1 VM)
Scale speed	Seconds	Seconds	Minutes
Concurrency per instance	Configurable (default 80, max 1000)	1 per invocation	Unlimited; you manage it in code
Cold start control	min-instances, cpu-boost	min-instances (2nd gen only)	Not applicable (VMs stay warm)
Max instance cap	Explicit flag	Explicit flag	Autoscaler max replicas config
Downstream connection risk	High without cap	High (one connection per invocation)	Predictable (fixed instance count)

Cloud Functions 1st gen uses a strict one-invocation-per-instance model, which means concurrency is effectively always 1. Under high throughput, this creates many more instances than Cloud Run would for the same request volume, with correspondingly more cold start exposure. Cloud Functions 2nd gen is built on Cloud Run and shares the same concurrency model. The main difference is the programming model and trigger types. See Event-Driven Patterns with Cloud Functions for when functions are the right choice over a full Cloud Run service.

VM-based autoscaling with Managed Instance Groups scales over minutes rather than seconds, cannot scale to zero, and requires you to manage the web server concurrency yourself. VMs suit workloads that need persistent local state, GPU access, custom kernel modules, or long-running daemon processes. For stateless HTTP workloads, Cloud Run’s faster scaling and zero-idle-cost model is almost always preferable. See Choosing Between Cloud Run, GKE, and VMs for a full breakdown.

Frequently asked questions

What causes cold starts in Cloud Run?

A cold start happens when a request arrives and no warm instance exists. Cloud Run pulls any uncached container image layers, starts the container process, and waits for your application to signal readiness. The main causes are: the service has scaled to zero (min-instances=0 with no recent traffic), a traffic spike that exceeds current instance capacity, or a first-ever deployment. Large images, interpreted languages with slow startup, and heavy initialisation logic on boot all make cold starts longer.

Does setting min-instances to 1 eliminate cold starts completely?

It eliminates the most common type: the "service woke up from zero" cold start. One warm minimum instance means the first request after any idle period hits a ready container instead of waiting for startup. But if traffic spikes and Cloud Run needs to start additional instances beyond that baseline, those new instances still go through cold starts. min-instances removes the idle-to-first-request problem, not the sudden-traffic-spike problem.

What concurrency setting should I use for Cloud Run?

The default of 80 is a sensible starting point for most I/O-bound services (APIs waiting on databases or external HTTP calls). For CPU-bound workloads like image processing or video encoding, lower concurrency (1 to 5) prevents CPU contention when multiple requests run simultaneously on the same instance. If you are unsure, start with the default and watch CPU utilisation metrics under load before lowering it.

What happens when Cloud Run reaches the max-instances limit?

When all instances are at their concurrency limit and the max-instances ceiling has been reached, incoming requests queue briefly. If the queue fills or requests exceed the timeout, Cloud Run returns HTTP 429 (Too Many Requests). Set max-instances based on what your downstream services can handle. A database accepting 100 connections will be exhausted instantly if 100 Cloud Run instances each try to open one.

Does Cloud Run scale to zero automatically?

Yes. With min-instances set to 0 (the default), Cloud Run terminates all instances after a short idle window. There is no infrastructure cost when the service receives no traffic. The trade-off is cold start latency on the next request after that idle period. Set min-instances to at least 1 for user-facing services or anything with a latency SLO.

Is Cloud Run suitable for user-facing production services?

Yes, with the right configuration. Set min-instances to at least 1 to prevent the cold start on idle, set max-instances based on your downstream capacity, and enable CPU boost to reduce scale-out cold start duration. The default scale-to-zero behaviour is better suited to internal tools, webhooks, and event handlers where callers are not end users waiting for a page to load.

Last verified: 22 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.