Batch Jobs in GCP: Cloud Run Jobs vs Cloud Batch Explained
A batch job runs to completion rather than serving ongoing requests. Processing a day’s worth of data, generating a report, transforming files, evaluating a model: these are batch workloads. GCP gives you two services for this pattern. Cloud Run Jobs is the simpler option: one command to create, one to run, container-based, done. Cloud Batch handles the harder cases: GPU clusters, MPI parallelism across VMs, HPC workloads, and tasks that run for days. Both provision compute, run tasks in parallel, and release everything when the job finishes.
Simple explanation
Most cloud services wait for requests and respond to them: a web server, an API, a Cloud Function. A batch job is different. You hand it a dataset or a list of tasks, tell it how many workers to spin up, and it grinds through the work and exits when finished.
A batch job is like a printing press run. You load the paper, press start, and it prints 10,000 copies without you asking for each one. When it is finished, it stops. No requests, no waiting, just a defined workload with a defined end.
The difference from a scheduled service is intent: a batch job is designed to complete. It has a defined start, a defined workload, and a defined end. You are not keeping a server warm for the next request. You are running a finite task and shutting everything down afterwards.
If your work is event-driven rather than time-triggered — reacting to a new file upload rather than running nightly — consider Cloud Tasks or event-driven architectures with Pub/Sub instead.
How batch jobs work in GCP
The lifecycle follows the same pattern whether you use Cloud Run Jobs or Cloud Batch:
- Define the workload. Package your processing logic in a container image and push it to Artifact Registry. Your code should read a task index from an environment variable so each worker knows which slice of the data to process.
- Set the task count and parallelism. Task count is the total number of work units. Parallelism is how many run at the same time. 100 tasks at parallelism 10 means 10 workers at a time, cycling through until all 100 are done.
- Submit the job. GCP provisions compute, starts the tasks, and passes each one its index via an environment variable.
- Tasks process their assigned slice. Each task reads its index, selects the records or files it owns, and processes only those.
- The job finishes and resources are released. Once all tasks complete successfully, GCP tears down the compute. You pay only for what ran.
- Logs and retries matter. Both services write logs to Cloud Logging. Retries handle transient failures and Spot VM preemptions. Long tasks should write checkpoints to Cloud Storage so they can resume rather than restart from scratch.
For recurring jobs, use Cloud Scheduler to trigger execution on a cron schedule: nightly at 2am, every Monday morning, or whatever cadence your workload needs.
Choosing between Cloud Run Jobs and Cloud Batch
Start with Cloud Run Jobs. It covers the vast majority of batch workloads with far less configuration. Only switch to Cloud Batch when you hit a specific requirement it cannot meet: GPUs, MPI, multi-day runtimes, or deep VM configuration.
| Feature | Cloud Run Jobs | Cloud Batch |
|---|---|---|
| Setup complexity | Low: one gcloud command | Medium: JSON job spec required |
| Runtime model | Container-only (serverless-feeling) | Container or script on managed VMs |
| Container support | Yes, required | Yes, optional (scripts also work) |
| VM control | Limited | Full: machine type, local SSD, GPU |
| GPU support | Limited | Yes, full GPU instance support |
| MPI / HPC support | No | Yes: multiple tasks per VM, shared memory |
| Max task duration | Up to 24 hours | Unlimited (tasks can run for days) |
| Parallelism | Yes, configurable task count and concurrency | Yes, configurable at task group level |
| Spot VM support | Yes (gen2 execution environment) | Yes via SPOT provisioningModel |
| Scheduling | Via Cloud Scheduler | Via Cloud Scheduler |
| Best use cases | ETL, exports, reports, data pipelines | ML training, HPC, genomics, rendering, long transforms |
| Best for beginners | Yes, straightforward to get started | Steeper learning curve due to job spec format |
| Operational overhead | Very low | Higher: more configuration to manage |
Choose Cloud Run Jobs when:
- Your work fits in a container and runs in under 24 hours
- You want the simplest possible setup
- You are running ETL, report generation, file transforms, or data exports
- You do not need GPU acceleration or MPI
Choose Cloud Batch when:
- You need GPU instances for ML training or rendering
- You need MPI parallelism across multiple VMs
- Tasks need to run longer than 24 hours
- You need fine-grained control over machine type, local SSD, or network configuration
- You are running HPC, genomics, or simulation workloads
The real trade-off between the two
Cloud Run Jobs is a managed taxi service. You say where you want to go, and it handles the car, the route, and the driving. Cloud Batch is a rental car. You get full control of the vehicle: choose the model, the fuel type, the GPS settings. That control is worth it when you have specific requirements. When you just need to get somewhere, the taxi is faster and simpler.
Cloud Run Jobs feels like a serverless tool. You define a container, say how many tasks to run, and submit. GCP handles the rest. You never configure a VM, write a job spec, or think about allocation policies. For most teams building data pipelines or automation tasks, this is exactly the right level of abstraction.
Cloud Batch is infrastructure-oriented. You write a JSON job specification that declares machine types, provisioning models, task groups, compute resources, and logging policies. That verbosity exists for good reason: it gives you the control required for GPU workloads, HPC clusters, and jobs that run for days on specialised hardware.
For beginners: start with Cloud Run Jobs. Most batch work fits comfortably within its limits. You will spend less time on configuration and more time on the actual work. See choosing between Cloud Run, GKE, and VMs for the broader compute decision picture.
When to use batch jobs in GCP
Batch processing covers a wide range of real workloads. Here are the most common patterns and which service fits each.
Good fits for Cloud Run Jobs:
- Nightly ETL jobs. Extract records from a source, transform them, load into BigQuery or Cloud Storage.
- Report generation. Pull data, compute summaries, write output files.
- Image or document processing. Resize images, convert formats, or run OCR on documents in bulk.
- Scheduled data exports. Export query results, send files to external systems on a cron schedule.
- Data cleanup jobs. Deduplicate, validate, or archive old records.
- Model evaluation runs. Score a dataset against a pre-trained model.
Good fits for Cloud Batch:
- ML model training. GPU-accelerated training jobs that run for hours or days.
- Genomics pipelines. Large-scale parallel DNA or RNA analysis.
- 3D rendering. Frame-by-frame rendering with many parallel workers.
- HPC simulations. Tightly coupled workloads needing MPI across multiple VMs.
- Large-scale data transforms. Jobs that need local SSD throughput or specialised hardware configurations.
For streaming workloads where data arrives continuously rather than in discrete batches, consider Dataflow instead. Dataflow handles both batch and streaming pipelines and is better suited to continuous data processing at scale.
Cloud Run Jobs
Cloud Run Jobs extends Cloud Run
for run-to-completion workloads. You package your logic in a
container image,
push it to Artifact Registry, and create a job with a single
gcloud command. Tasks run in parallel containers, each
receiving its index via environment variables, and the job completes
when all tasks succeed.
This is the recommended starting point for most batch work. No JSON spec, no VM configuration, minimal operational surface.
Create and run a Cloud Run Job:
# Create a Cloud Run Job
gcloud run jobs create my-job \
--image=us-central1-docker.pkg.dev/PROJECT_ID/my-repo/processor:v1 \
--region=us-central1 \
--tasks=50 \
--parallelism=5 \
--max-retries=3 \
--task-timeout=3600sKey flags:
—tasks: total number of task instances to run, each gets a unique index—parallelism: how many tasks run simultaneously (5 here; the rest queue)—max-retries: how many times to retry a failed or preempted task—task-timeout: maximum duration per task before it is killed and retried
# Execute the job immediately
gcloud run jobs execute my-job --region=us-central1
# Execute and wait for completion before returning
gcloud run jobs execute my-job \
--region=us-central1 \
--wait
# View execution history
gcloud run jobs executions list \
--job=my-job \
--region=us-central1Inside your container, CLOUD_RUN_TASK_INDEX tells each task
its position (0-based) and CLOUD_RUN_TASK_COUNT tells it the
total number of tasks. Use both to partition the work so no two tasks
process the same data.
Cloud Batch
Cloud Batch exists for workloads that need more than Cloud Run Jobs can offer: GPU instances, MPI-coupled tasks, jobs that run for days, or precise control over machine type and storage. It manages the underlying VMs for you but you configure everything through a JSON job specification.
If you are coming from HPC, scientific computing, or ML training pipelines, Cloud Batch will feel familiar. If this is your first batch workload in GCP, start with Cloud Run Jobs and come back to Cloud Batch only when you hit a requirement it cannot address.
# Enable the Batch API first
gcloud services enable batch.googleapis.comDefine the job in a JSON file:
{
"taskGroups": [
{
"taskSpec": {
"runnables": [
{
"container": {
"imageUri": "us-central1-docker.pkg.dev/PROJECT_ID/my-repo/processor:v1",
"commands": ["python", "process.py", "--task-index", "BATCH_TASK_INDEX"]
}
}
],
"computeResource": {
"cpuMilli": 2000,
"memoryMib": 4096
},
"maxRetryCount": 3,
"maxRunDuration": "3600s"
},
"taskCount": 100,
"parallelism": 10
}
],
"allocationPolicy": {
"instances": [
{
"policy": {
"machineType": "e2-standard-4",
"provisioningModel": "SPOT"
}
}
]
},
"logsPolicy": {
"destination": "CLOUD_LOGGING"
}
}Key fields in the job spec:
- taskGroups: one or more groups of tasks; each group shares the same task spec
- runnables: the container (or script) that each task executes
- computeResource: CPU and memory allocated to each task
- maxRetryCount: retries per task on failure or preemption
- maxRunDuration: task timeout; tasks running longer than this are killed and retried
- taskCount / parallelism: total tasks and how many run at the same time
- allocationPolicy: the VM type and provisioning model (SPOT reduces cost significantly)
- logsPolicy: where task logs go (CLOUD_LOGGING sends them to Cloud Logging)
# Submit the job
gcloud batch jobs submit my-batch-job \
--location=us-central1 \
--config=job.json
# Check job status
gcloud batch jobs describe my-batch-job --location=us-central1
# List all jobs in a region
gcloud batch jobs list --location=us-central1How task partitioning works
Parallelism only helps if each task processes a different slice of the data. Without partitioning, every task processes the full dataset, producing duplicate results and wasting the entire budget of running multiple workers at once.
Running 50 parallel tasks that each fetch and process all 50,000 records produces 50 duplicate outputs. This is the single most common mistake in batch job design. Always use the task index to partition work before writing any other logic.
The solution is to use the task index. GCP injects the index into each task as an environment variable:
- Cloud Run Jobs:
CLOUD_RUN_TASK_INDEXandCLOUD_RUN_TASK_COUNT - Cloud Batch:
BATCH_TASK_INDEXandBATCH_TASK_COUNT
Your code reads these values and uses them to select only the items it is responsible for. A simple modulo pattern handles this cleanly even when the total item count does not divide evenly:
import os
task_index = int(os.environ.get("CLOUD_RUN_TASK_INDEX", 0))
task_count = int(os.environ.get("CLOUD_RUN_TASK_COUNT", 1))
# Fetch the full list once, then select only this task's slice
all_items = fetch_all_items()
my_items = [item for i, item in enumerate(all_items) if i % task_count == task_index]
for item in my_items:
process(item)Processing 1,000 files with 10 tasks is like dealing a deck of 1,000 cards evenly across 10 players. Each player gets exactly 100 cards — their assigned slice — and plays only those. The task index is which player you are. No card is played twice, every card is covered.
For Cloud Batch, replace CLOUD_RUN_TASK_INDEX with
BATCH_TASK_INDEX and CLOUD_RUN_TASK_COUNT with
BATCH_TASK_COUNT. The pattern is identical.
Using Spot VMs to reduce cost
Spot VMs reduce batch job compute costs by 60–91% compared to on-demand pricing. Both Cloud Run Jobs and Cloud Batch support Spot capacity. The trade-off is that Spot VMs can be preempted at any time, so your job design must account for interruptions.
Two things are required for safe Spot-based batch jobs:
- Retries so preempted tasks restart automatically rather than failing permanently.
- Checkpoints for long tasks: write progress to Cloud Storage so a restarted task resumes from the last saved point rather than from scratch.
Cloud Run Job on Spot VMs (requires the gen2 execution environment):
gcloud run jobs create my-spot-job \
--image=IMAGE \
--region=us-central1 \
--tasks=100 \
--parallelism=10 \
--max-retries=3 \
--execution-environment=gen2Cloud Batch Spot configuration (in the allocationPolicy section of job.json):
{
"allocationPolicy": {
"instances": [
{
"policy": {
"machineType": "e2-standard-4",
"provisioningModel": "SPOT"
}
}
]
}
}Cost considerations
Both services can use Spot capacity for significant savings, but the cheapest option is not always the simplest to operate.
- Spot savings require retries. Without retry configuration, a single preemption fails the task permanently. Always set retries when using Spot.
- Long tasks need checkpointing. A 6-hour Spot task preempted at hour 5 with no checkpointing restarts from zero. Write intermediate results to Cloud Storage at regular intervals.
- Cloud Run Jobs reduces operational overhead for smaller teams. Less configuration means fewer things to maintain, debug, and audit. For most workloads, this matters more than squeezing the last percentage point of cost savings.
- Cloud Batch is better value for specialised compute. If you genuinely need GPU instances or large-memory VMs, Cloud Batch’s direct VM control and Spot support at that tier can produce better economics.
- Parallelism affects time-to-completion, not total cost. Running 100 tasks at parallelism 100 costs the same total compute as parallelism 10. Higher parallelism finishes faster but does not reduce cost.
See cost optimisation strategies in GCP for the broader picture on managing cloud spend.
How to trigger and monitor jobs
Manual execution:
# Trigger a Cloud Run Job immediately
gcloud run jobs execute my-job --region=us-central1 --wait
# Trigger a Cloud Batch job
gcloud batch jobs submit my-batch-job \
--location=us-central1 \
--config=job.jsonScheduled execution uses Cloud Scheduler to run jobs on a cron schedule. Cloud Scheduler can call the Cloud Run Jobs execute API directly, or publish a message to Pub/Sub that triggers a downstream process to submit a Cloud Batch job.
Viewing logs:
# Logs for a Cloud Run Job execution
gcloud logging read \
'resource.type="cloud_run_job" AND resource.labels.job_name="my-job"' \
--limit=100 \
--format="value(textPayload)"
# Logs for a Cloud Batch job
gcloud logging read \
'resource.type="batch.googleapis.com/Job" AND labels."batch.googleapis.com/job_name"="my-batch-job"' \
--limit=100 \
--format="value(textPayload)"Both services write structured logs to Cloud Logging. Use the Logs Explorer to filter by job name, task index, or severity. For persistent visibility, set up log-based metrics and alerts in Cloud Monitoring to notify you when a job fails or takes longer than expected.
Checking job status:
# Cloud Run Job execution status
gcloud run jobs executions describe EXECUTION_NAME --region=us-central1
# Cloud Batch job status
gcloud batch jobs describe my-batch-job --location=us-central1Common mistakes
Every task processes the full dataset. 50 parallel tasks that each process all 50,000 records produce 50 duplicate outputs and waste 49x the compute. Read
CLOUD_RUN_TASK_INDEXorBATCH_TASK_INDEXand use it to select only that task’s assigned slice.No retries for Spot-based jobs. Spot VMs can be preempted at any point. With retries at 0, a preempted task fails permanently. Set
—max-retries(Cloud Run Jobs) ormaxRetryCount(Cloud Batch) to at least 2 or 3.No checkpointing for long-running Spot tasks. A task preempted at hour 5 of a 6-hour run with no checkpointing restarts from the beginning. Write progress to Cloud Storage at regular intervals and check for an existing checkpoint at startup.
No task timeout set. A task stuck in an infinite loop or waiting on a hung external call will run until you cancel it manually. Set
maxRunDuration(Cloud Batch) or—task-timeout(Cloud Run Jobs) so stuck tasks are killed and retried automatically.Choosing Cloud Batch when Cloud Run Jobs would be simpler. Cloud Batch requires a JSON job spec with allocation policies, task groups, and compute resource definitions. For a standard ETL job or report generator, this adds complexity with no benefit. Use Cloud Run Jobs unless you genuinely need what Cloud Batch offers.
Choosing Cloud Run Jobs when GPU or MPI support is needed. If your workload requires GPU-accelerated processing or tight MPI coupling across VMs, Cloud Run Jobs cannot provide it. Recognise this requirement early and reach for Cloud Batch before building around the wrong tool.
Summary
- Batch jobs run to completion: processing a dataset, generating a report, transforming files, then they stop
- Cloud Run Jobs is simpler and the right default for most containerised batch workloads
- Cloud Batch handles GPUs, MPI, multi-day tasks, and HPC; use it when Cloud Run Jobs cannot meet your requirements
- Use the task index environment variable to partition work across parallel tasks, never process the full dataset in every task
- Spot VMs reduce cost by 60–91%; always pair Spot with retries and checkpointing for long tasks
- Set a task timeout so hung tasks are killed and retried rather than running indefinitely
- Use Cloud Scheduler to trigger jobs on a schedule and Cloud Logging to monitor results
Frequently asked questions
What is a batch job in GCP?
A batch job is a unit of work that runs to completion rather than continuously serving requests. Examples include processing overnight data, generating reports, transforming files, or running a model evaluation. In GCP, the two main services for running batch jobs are Cloud Run Jobs (simpler, container-first) and Cloud Batch (VM-level control, GPU support, HPC use cases).
When should I use Cloud Batch instead of Cloud Run Jobs?
Use Cloud Batch when you need GPU instances, MPI parallelism across VMs, very long-running tasks (beyond 24 hours), precise VM configuration (custom machine types, local SSDs), or HPC-style workloads. Use Cloud Run Jobs for most containerised batch work: ETL pipelines, exports, reports, and scheduled data tasks. If Cloud Run Jobs can handle it, use that — the setup is simpler and the operational overhead is lower.
Can batch jobs run on a schedule?
Yes. Both Cloud Run Jobs and Cloud Batch jobs can be triggered on a schedule using Cloud Scheduler. Cloud Scheduler sends an HTTP request or Pub/Sub message to trigger the job at a defined cron interval. This is the standard approach for nightly ETL jobs, daily exports, and recurring reports.
How do parallel tasks avoid processing the same data twice?
Each task reads its index from an environment variable: CLOUD_RUN_TASK_INDEX in Cloud Run Jobs, or BATCH_TASK_INDEX in Cloud Batch. The task uses that index to select its slice of the dataset. For example, with 10 tasks and 1,000 records, task 0 processes records 0-99, task 1 processes 100-199, and so on. The modulo pattern (process item i if i % task_count == task_index) works even when the total count is not evenly divisible.
Can I use Spot VMs for batch jobs in Google Cloud?
Yes. Both Cloud Run Jobs and Cloud Batch support Spot capacity, which reduces compute costs by 60–91%. Spot VMs can be preempted at any time, so always set retries (--max-retries in Cloud Run Jobs, maxRetryCount in Cloud Batch). For long-running tasks, write checkpoints to Cloud Storage so a preempted task can resume from where it left off rather than restarting from scratch.