Cloud Run Cost Optimisation: Reduce CPU, Memory and Idle Spend

Cloud Run is one of the cheapest ways to run containers on GCP, when configured correctly. With request-based billing, you pay nothing when no requests are arriving. But a few common misconfigurations (wrong billing model, unnecessary minimum instances, or over-provisioned CPU) can quietly push costs far higher than expected.

This page explains every setting that affects your Cloud Run bill, when each setting matters, and how to tune them safely. Whether you are running a single low-traffic API or dozens of production services, the goal is the same: pay only for the compute you actually use.

If you are new to Cloud Run, start with the Cloud Run overview and deploying your first service before diving into cost tuning.

What to change first

Before reading the full page, check these five settings on every Cloud Run service:

Billing model: are you using request-based or instance-based billing? Most HTTP services should use request-based.
Minimum instances: is it set to 0? If not, do you have a measured reason for keeping instances warm?
Concurrency: is it still the default (80)? I/O-bound services can usually go much higher.
CPU and memory: are allocations sized to actual usage, or left at defaults?
Networking and egress: are you routing through a VPC connector or load balancer you do not need?

Simple explanation

Think of it like a taxi meter

Cloud Run billing works like a taxi that only charges you while you are moving. The meter ticks for three things: the number of rides (requests), the engine running time (CPU), and the size of the car (memory). When no passengers are in the car and you have no minimum instances, the meter stops completely and you pay nothing. That is the key advantage over VMs, which are like owning a car that costs money whether you drive it or not.

Cost goes up in four situations: instances stay active longer (slow requests, background processing, or minimum instances keeping containers alive), they use more CPU and memory than needed, more instances run simultaneously (low concurrency forcing extra scale-out), or responses leave Google’s network (internet egress). Every section below maps to one of these cost drivers.

How it works

Cloud Run offers two billing models. Google’s current terminology is request-based billing and instance-based billing. You may also see the older names “CPU allocated only during request processing” and “CPU always allocated” in documentation and CLI output. They refer to the same thing.

With request-based billing, CPU is allocated and billed only while a container is actively handling a request. Between requests, CPU is throttled and you pay nothing for it. This is the default for new Cloud Run services.

With instance-based billing, CPU is allocated continuously from the moment an instance starts until it shuts down, regardless of whether it is handling a request. This is required when your container needs to do work outside of a request, for example running a background thread that processes items from a queue.

	Request-based billing	Instance-based billing
CPU billed	Only during request handling	Entire instance lifetime
Idle cost	Zero (CPU throttled between requests)	Full CPU rate while instance is alive
Cheaper when	Traffic is bursty or low-volume	Instances are busy most of the time
Background work	Not possible (CPU throttled)	Supported
Typical use case	HTTP APIs, webhooks, websites	Services with background threads, long polling, persistent connections
CLI flag	`—cpu-throttling`	`—no-cpu-throttling`

On top of CPU and memory billing, you also pay per request ($0.40 per million after a free tier). Networking costs (internet egress, load balancer processing, VPC connector throughput) are billed separately and can be significant. See the network egress costs guide for a full breakdown of what you pay for data leaving GCP.

When to use each billing model

The right billing model depends on your traffic pattern and whether the container does work outside of HTTP request handling. Here are concrete examples:

Low-traffic HTTP API (under 1,000 requests/day): Request-based billing. The service is idle most of the time, so you pay almost nothing between requests. With instance-based billing, you would pay for CPU during all that idle time.
Steady production API (thousands of requests/minute): Request-based billing is still usually cheaper because instances stay busy only during request processing. Switch to instance-based only if profiling shows that the CPU throttling between requests causes measurable cold-path latency.
Scheduled work triggered by Cloud Scheduler: Request-based billing. The scheduler sends an HTTP request that triggers the work. The container processes the request and the CPU is billed for that duration. There is no need for instance-based billing unless the work continues after the HTTP response is returned.
Event-driven processing (Pub/Sub push subscriptions): Request-based billing works for most Pub/Sub push patterns because each message arrives as an HTTP request. The container processes the message within the request lifecycle. Instance-based billing is only needed if processing continues after the HTTP acknowledgement.
Background work outside request handling: Instance-based billing. If your container runs a background thread (pulling from a queue, maintaining a WebSocket connection, or doing periodic maintenance) it needs CPU allocated continuously. Request-based billing would throttle the CPU between requests and stop that background work.

Cloud Run jobs vs services

Cloud Run jobs are billed differently from services. A job runs a container to completion and is always billed for the full execution time. There is no request-based option. Jobs are the right choice for batch processing, data migrations, and scheduled tasks that do not need an HTTP endpoint. Use Cloud Scheduler to trigger jobs on a cron schedule, or Cloud Tasks for queue-driven execution.

How to reduce Cloud Run costs safely

Follow this sequence. Each step builds on the previous one, and skipping ahead can lead to changes that look good on paper but cause reliability problems in production.

Step 1: Establish baseline metrics

Before changing any configuration, collect data on how your service actually behaves. You need request volume, latency distribution, instance count over time, and CPU/memory utilisation. See the monitoring Cloud Run guide for a full walkthrough of available metrics.

# Request count over the last 7 days
gcloud monitoring time-series list \
  --project=PROJECT_ID \
  --filter='metric.type="run.googleapis.com/request_count"
            AND resource.labels.service_name="my-service"' \
  --aggregation-align-period=3600s \
  --aggregation-per-series-aligner=ALIGN_SUM

# Average and p95 request latency
gcloud monitoring time-series list \
  --project=PROJECT_ID \
  --filter='metric.type="run.googleapis.com/request_latencies"
            AND resource.labels.service_name="my-service"'

# Instance count over time
gcloud monitoring time-series list \
  --project=PROJECT_ID \
  --filter='metric.type="run.googleapis.com/container/instance_count"
            AND resource.labels.service_name="my-service"'

# Memory utilisation
gcloud monitoring time-series list \
  --project=PROJECT_ID \
  --filter='metric.type="run.googleapis.com/container/memory/utilizations"
            AND resource.labels.service_name="my-service"'

Step 2: Choose the right billing model

Check which billing model each service uses. If a service only handles HTTP requests and does no background work, it should be on request-based billing.

# Check the current billing model
gcloud run services describe my-service \
  --region=us-central1 \
  --format="value(spec.template.metadata.annotations['run.googleapis.com/cpu-throttling'])"
# "true" = request-based billing, "false" = instance-based billing

# Switch to request-based billing
gcloud run services update my-service \
  --region=us-central1 \
  --cpu-throttling

# Switch to instance-based billing (only if background work is needed)
gcloud run services update my-service \
  --region=us-central1 \
  --no-cpu-throttling

The most expensive misconfiguration

A service on instance-based billing that does not actually need it is the single most expensive Cloud Run misconfiguration. You pay for CPU during every idle gap between requests. For a bursty service, this can cost 3-5x more than request-based billing. If you only change one thing after reading this page, check the billing model on every service you run.

Step 3: Review minimum instances

Check whether minimum instances are set above zero. Every minimum instance bills for CPU and memory continuously, even with no traffic.

# Check current min instances
gcloud run services describe my-service \
  --region=us-central1 \
  --format="value(spec.template.metadata.annotations['run.googleapis.com/minScale'])"

# Set to zero (recommended default)
gcloud run services update my-service \
  --region=us-central1 \
  --min-instances=0

The “just in case” trap

Setting min-instances=1 is like leaving your car engine running in the driveway all night so it starts faster in the morning. It works, but you burn fuel 24/7 for a few seconds of convenience. For a low-traffic service handling a few hundred requests per day, the idle billing from a single minimum instance often exceeds the cost of all actual request processing combined.

Step 4: Tune concurrency

Higher concurrency means fewer instances for the same traffic, which means lower cost. But only increase concurrency if your application can handle it without degrading latency.

# Increase concurrency for I/O-bound services
gcloud run services update my-service \
  --region=us-central1 \
  --concurrency=200

# Lower concurrency for CPU-bound services
gcloud run services update my-service \
  --region=us-central1 \
  --concurrency=10

After changing concurrency, watch p95 latency and error rate for at least 24 hours. If latency increases, reduce concurrency. For more on how Cloud Run scales in response to concurrency, see the scaling behaviour guide.

Concurrency analogy

Think of concurrency like tables in a restaurant. A waiter (instance) handling 1 table at a time needs many waiters for a busy night. If each waiter can handle 10 tables (higher concurrency), you need far fewer waiters. But if they are juggling too many tables, service quality drops. The right number depends on how demanding each table is.

Step 5: Right-size CPU and memory

Check actual CPU and memory usage against allocated amounts. If your container consistently uses 30% of its allocated memory, you are paying for 70% waste.

# Reduce memory allocation
gcloud run services update my-service \
  --region=us-central1 \
  --memory=256Mi

# Reduce CPU allocation
gcloud run services update my-service \
  --region=us-central1 \
  --cpu=1

Step 6: Set maximum instances and cost guardrails

Maximum instances cap how far Cloud Run can scale. Without a limit, a traffic spike can spin up hundreds of instances. Set a max based on your expected peak plus headroom, and pair it with billing budgets and alerts to catch unexpected spend.

# Set maximum instances
gcloud run services update my-service \
  --region=us-central1 \
  --max-instances=20

Step 7: Review region, networking, and egress path

Region choice affects both latency and cost. Some regions have higher compute rates. Networking configuration (VPC connectors, direct VPC egress, load balancers) each adds cost. Review whether your current networking path is the simplest one that meets your requirements. See network egress costs for the full pricing breakdown.

Step 8: Validate changes with monitoring and billing data

After each change, wait for at least one full billing cycle (or one full traffic pattern cycle, whichever is longer). Compare the billing data before and after. Use cost breakdown tools to verify that total Cloud Run spend decreased without degrading reliability metrics. If latency or error rates increased, roll back and investigate.

Key cost levers

Request-based vs instance-based billing

This is the single highest-impact setting. A service that only handles HTTP requests will almost always cost less on request-based billing because CPU is not billed during idle time between requests. Switching a service from instance-based to request-based billing can reduce CPU costs by 60-80% for services with significant idle time.

Minimum instances

Each minimum instance bills for CPU and memory 24/7. For a service with 1 vCPU and 512 MB of memory, a single minimum instance adds roughly the cost of running that container continuously all month, even if it handles zero requests. Use min-instances=0 as the default. Add minimum instances only for user-facing services where cold start latency (typically 1-10 seconds) would cause a noticeably poor experience.

Concurrency

The default concurrency is 80 requests per instance. If your service is I/O-bound (most time spent waiting on database or network calls), increasing concurrency to 200-500 means fewer instances at peak traffic and proportionally lower cost. For CPU-bound services, high concurrency causes contention. Measure p95 latency after every concurrency change.

CPU sizing

CPU is billed per vCPU-second and is typically the largest line item. Available allocations range from 1 to 8 vCPUs. Start with 1 vCPU and increase only if your application is CPU-bound and latency metrics show it needs more. Doubling the CPU allocation doubles the CPU cost per second.

Memory sizing

Memory is billed per GiB-second. Start with the minimum your application needs (128 Mi or 256 Mi for most lightweight services) and increase if you see OOM kills. An OOM kill causes a cold restart, which is more expensive overall than a slightly higher memory allocation. Monitor the container/memory/utilizations metric to find the right level.

The OOM paradox

Cutting memory too aggressively can actually increase your bill. When a container hits its memory limit, it gets killed and restarted from scratch. That cold restart burns extra CPU and adds latency for users. A little headroom is cheaper than frequent restarts.

Maximum instances

Without a maximum, Cloud Run scales as far as it needs to handle incoming traffic. This is a feature, but it means a traffic spike or retry storm can spin up many instances quickly. Set a maximum that accommodates your expected peak, and use billing alerts as a safety net.

Region choice

Cloud Run pricing varies by region. North American and European regions are generally cheaper than Asia-Pacific or South American regions. If your users are concentrated in one geography, deploy in the nearest low-cost region. Check the GCP Pricing Calculator for current per-region rates.

Networking and internet egress

Data leaving Google’s network to the internet is billed at standard GCP egress rates. For services that return large responses (images, file downloads, API payloads), egress can become a significant cost. Use Cloud CDN for cacheable responses. Compress payloads. Keep traffic within Google’s network where possible (same-region communication between GCP services is free).

Hidden egress surprises

Egress costs do not show up on the Cloud Run line item in your bill. They appear under Networking. This means you can look at your Cloud Run costs, think everything is fine, and miss thousands of dollars in egress charges hiding elsewhere. Always check your full billing breakdown by SKU.

Direct VPC egress and VPC connectors

If your Cloud Run service connects to resources in a VPC (Cloud SQL, Memorystore, internal services), you need either a Serverless VPC Access connector or direct VPC egress. VPC connectors run on dedicated instances and have their own compute cost. Direct VPC egress (where available) avoids the connector overhead and is usually cheaper. Review which approach you are using and whether you can switch.

Committed use discounts

Cloud Run supports spend-based committed use discounts (CUDs). If you commit to a minimum level of Cloud Run spend for one or three years, you receive a discount on that committed amount. This makes sense only for workloads with predictable, sustained Cloud Run spend. See GCP pricing models for how CUDs work across GCP services.

Budgets, alerts, and cost visibility

Set up billing budgets and alerts on your Cloud Run services. Use the GCP Cost Table and Billing Reports to break down spend by service, SKU, and label. The Active Assist Recommender can surface idle Cloud Run services and suggest configuration changes. Use resource-level cost breakdowns to find which specific services cost the most.

Common mistakes

Leaving min instances enabled by default. Many teams set min-instances=1 on every service to avoid cold starts and never revisit it. For a low-traffic service handling a few hundred requests per day, the idle billing from a single minimum instance can exceed the cost of all actual request processing. Audit minimum instance settings across all services regularly.
Using instance-based billing for a request-only service. If your container does no work between HTTP requests, instance-based billing means you are paying for CPU during every idle gap. This is one of the most common and expensive misconfigurations. Check the billing model on every service during cost reviews.
Over-allocating CPU or memory. Starting with 2 vCPUs and 1 GiB of memory “just in case” doubles or quadruples the cost compared to 1 vCPU and 256 Mi. Start low, monitor utilisation, and scale up only when metrics show you need it.
Ignoring max instances. Without a maximum, a sudden traffic spike or retry loop can scale Cloud Run to hundreds of instances in minutes. Set a maximum based on your expected peak plus reasonable headroom.
Forgetting egress and networking costs. Cloud Run compute billing is only part of the total. Internet egress, load balancer fees, VPC connector costs, and logging/monitoring ingestion all add up. Review the full billing breakdown, not just the Cloud Run line item.
Tuning concurrency without measuring latency. Increasing concurrency reduces instance count and cost, but if the application cannot handle the load, p95 latency spikes and error rates increase. Always compare latency before and after concurrency changes.
Treating Cloud Run jobs and services as interchangeable. Jobs and services have different billing models, scaling behaviour, and use cases. A job runs to completion and is always billed for full execution time. A service responds to requests and can scale to zero. Using a service where a job fits better (or vice versa) leads to unnecessary cost and complexity.

Request-based vs instance-based billing

This is the central cost decision for every Cloud Run service. Use this matrix to choose:

Scenario	Best billing model	Why
Bursty traffic (occasional spikes, long idle gaps)	Request-based	Zero cost during idle periods
Steady, high-volume traffic	Request-based (usually)	Still cheaper unless instances are nearly 100% busy
Background execution (queue workers, background threads)	Instance-based	CPU must stay active between requests
User-facing with strict latency requirements	Request-based + min instances	Eliminates cold starts without paying for always-on CPU
Cost control is the top priority	Request-based + min-instances=0	True pay-per-request, zero baseline cost

Common misconception about Pub/Sub and Cloud Scheduler

A common misconception is that Pub/Sub-triggered or Cloud Scheduler-triggered services always need instance-based billing. They do not. If the work happens entirely within the HTTP request lifecycle (message arrives as a push request, container processes it, returns a response), request-based billing works and is cheaper. Instance-based billing is only required when the container does work after returning the HTTP response or outside of request handling entirely.

If you are unsure which model is cheaper for your workload, use the cost estimation guide to model both scenarios with your actual traffic numbers.

Frequently asked questions

What is the biggest cost driver in Cloud Run?

CPU time dominates most Cloud Run bills. The vCPU-second rate is the highest unit cost and it multiplies against both request duration and CPU allocation. For a typical HTTP service, CPU accounts for 80-90% of the total. Reducing average request latency through application-level optimisation (faster database queries, caching, smaller payloads) has a larger impact on cost than any infrastructure setting change.

Should I set min instances to 1?

Only if your service is user-facing and cold start latency would cause a noticeably bad experience. A single minimum instance bills for CPU and memory 24/7, whether or not it receives traffic. For a low-traffic internal API or async processor, idle billing from min-instances=1 can exceed the cost of all actual request processing. Start with min-instances=0 and add minimum instances only after measuring cold start impact on real users.

Does higher concurrency always save money?

Not always. Higher concurrency reduces instance count, which reduces cost, but only if the application handles concurrent requests without degrading. I/O-bound services (waiting on databases or external APIs) benefit from concurrency of 200 or higher. CPU-bound services suffer from contention at high concurrency, which increases latency and can trigger more scaling, not less. Always measure p95 latency after changing concurrency.

Is Cloud Run still cost-effective for steady traffic?

It depends on the volume. For moderate steady traffic (under roughly 50 vCPU of sustained demand), Cloud Run with request-based billing is competitive with Compute Engine VMs, especially when you factor in the operational overhead of managing VMs. For very high sustained throughput, committed use discounts on VMs or GKE can be cheaper per vCPU-hour. Run the numbers with the GCP Pricing Calculator for your specific workload before deciding.

Do networking, egress, and logging costs count toward total Cloud Run spend?

Yes, and they are easy to overlook. Cloud Run compute billing (CPU, memory, requests) is only part of the total cost. Internet egress is billed separately at standard GCP networking rates. Cloud Logging ingestion and Cloud Monitoring custom metrics add cost at scale. Load Balancer usage, if enabled, adds an hourly charge plus per-GB processing fees. Review your full billing breakdown, not just the Cloud Run line item.

Last verified: 27 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.

Cloud Run Cost Optimisation: Reduce CPU, Memory and Idle Spend

Simple explanation

How it works

When to use each billing model

How to reduce Cloud Run costs safely

Step 1: Establish baseline metrics

Step 2: Choose the right billing model

Step 3: Review minimum instances

Step 4: Tune concurrency

Step 5: Right-size CPU and memory

Step 6: Set maximum instances and cost guardrails

Step 7: Review region, networking, and egress path

Step 8: Validate changes with monitoring and billing data

Key cost levers

Request-based vs instance-based billing

Minimum instances

Concurrency

CPU sizing

Memory sizing

Maximum instances

Region choice

Networking and internet egress

Direct VPC egress and VPC connectors

Committed use discounts

Budgets, alerts, and cost visibility

Common mistakes

Request-based vs instance-based billing

Summary

Related topics to read next

Frequently asked questions