Batch Jobs in GCP: Cloud Run Jobs vs Cloud Batch Explained

A batch job runs to completion rather than serving ongoing requests. Processing a day’s worth of data, generating a report, transforming files, evaluating a model: these are batch workloads. GCP gives you two services for this pattern. Cloud Run Jobs is the simpler option: one command to create, one to run, container-based, done. Cloud Batch handles the harder cases: GPU clusters, MPI parallelism across VMs, HPC workloads, and tasks that run for days. Both provision compute, run tasks in parallel, and release everything when the job finishes.

Simple explanation

Most cloud services wait for requests and respond to them: a web server, an API, a Cloud Function. A batch job is different. You hand it a dataset or a list of tasks, tell it how many workers to spin up, and it grinds through the work and exits when finished.

Analogy

A batch job is like a printing press run. You load the paper, press start, and it prints 10,000 copies without you asking for each one. When it is finished, it stops. No requests, no waiting, just a defined workload with a defined end.

The difference from a scheduled service is intent: a batch job is designed to complete. It has a defined start, a defined workload, and a defined end. You are not keeping a server warm for the next request. You are running a finite task and shutting everything down afterwards.

Note

If your work is event-driven rather than time-triggered — reacting to a new file upload rather than running nightly — consider Cloud Tasks or event-driven architectures with Pub/Sub instead.

How batch jobs work in GCP

The lifecycle follows the same pattern whether you use Cloud Run Jobs or Cloud Batch:

Define the workload. Package your processing logic in a container image and push it to Artifact Registry. Your code should read a task index from an environment variable so each worker knows which slice of the data to process.
Set the task count and parallelism. Task count is the total number of work units. Parallelism is how many run at the same time. 100 tasks at parallelism 10 means 10 workers at a time, cycling through until all 100 are done.
Submit the job. GCP provisions compute, starts the tasks, and passes each one its index via an environment variable.
Tasks process their assigned slice. Each task reads its index, selects the records or files it owns, and processes only those.
The job finishes and resources are released. Once all tasks complete successfully, GCP tears down the compute. You pay only for what ran.
Logs and retries matter. Both services write logs to Cloud Logging. Retries handle transient failures and Spot VM preemptions. Long tasks should write checkpoints to Cloud Storage so they can resume rather than restart from scratch.

For recurring jobs, use Cloud Scheduler to trigger execution on a cron schedule: nightly at 2am, every Monday morning, or whatever cadence your workload needs.

Choosing between Cloud Run Jobs and Cloud Batch

Tip

Start with Cloud Run Jobs. It covers the vast majority of batch workloads with far less configuration. Only switch to Cloud Batch when you hit a specific requirement it cannot meet: GPUs, MPI, multi-day runtimes, or deep VM configuration.

Feature	Cloud Run Jobs	Cloud Batch
Setup complexity	Low: one gcloud command	Medium: JSON job spec required
Runtime model	Container-only (serverless-feeling)	Container or script on managed VMs
Container support	Yes, required	Yes, optional (scripts also work)
VM control	Limited	Full: machine type, local SSD, GPU
GPU support	Limited	Yes, full GPU instance support
MPI / HPC support	No	Yes: multiple tasks per VM, shared memory
Max task duration	Up to 24 hours	Unlimited (tasks can run for days)
Parallelism	Yes, configurable task count and concurrency	Yes, configurable at task group level
Spot VM support	Yes (gen2 execution environment)	Yes via SPOT provisioningModel
Scheduling	Via Cloud Scheduler	Via Cloud Scheduler
Best use cases	ETL, exports, reports, data pipelines	ML training, HPC, genomics, rendering, long transforms
Best for beginners	Yes, straightforward to get started	Steeper learning curve due to job spec format
Operational overhead	Very low	Higher: more configuration to manage

Choose Cloud Run Jobs when:

Your work fits in a container and runs in under 24 hours
You want the simplest possible setup
You are running ETL, report generation, file transforms, or data exports
You do not need GPU acceleration or MPI

Choose Cloud Batch when:

You need GPU instances for ML training or rendering
You need MPI parallelism across multiple VMs
Tasks need to run longer than 24 hours
You need fine-grained control over machine type, local SSD, or network configuration
You are running HPC, genomics, or simulation workloads

The real trade-off between the two

Analogy

Cloud Run Jobs is a managed taxi service. You say where you want to go, and it handles the car, the route, and the driving. Cloud Batch is a rental car. You get full control of the vehicle: choose the model, the fuel type, the GPS settings. That control is worth it when you have specific requirements. When you just need to get somewhere, the taxi is faster and simpler.

Cloud Run Jobs feels like a serverless tool. You define a container, say how many tasks to run, and submit. GCP handles the rest. You never configure a VM, write a job spec, or think about allocation policies. For most teams building data pipelines or automation tasks, this is exactly the right level of abstraction.

Cloud Batch is infrastructure-oriented. You write a JSON job specification that declares machine types, provisioning models, task groups, compute resources, and logging policies. That verbosity exists for good reason: it gives you the control required for GPU workloads, HPC clusters, and jobs that run for days on specialised hardware.

For beginners: start with Cloud Run Jobs. Most batch work fits comfortably within its limits. You will spend less time on configuration and more time on the actual work. See choosing between Cloud Run, GKE, and VMs for the broader compute decision picture.

When to use batch jobs in GCP

Batch processing covers a wide range of real workloads. Here are the most common patterns and which service fits each.

Good fits for Cloud Run Jobs:

Nightly ETL jobs. Extract records from a source, transform them, load into BigQuery or Cloud Storage.
Report generation. Pull data, compute summaries, write output files.
Image or document processing. Resize images, convert formats, or run OCR on documents in bulk.
Scheduled data exports. Export query results, send files to external systems on a cron schedule.
Data cleanup jobs. Deduplicate, validate, or archive old records.
Model evaluation runs. Score a dataset against a pre-trained model.

Good fits for Cloud Batch:

ML model training. GPU-accelerated training jobs that run for hours or days.
Genomics pipelines. Large-scale parallel DNA or RNA analysis.
3D rendering. Frame-by-frame rendering with many parallel workers.
HPC simulations. Tightly coupled workloads needing MPI across multiple VMs.
Large-scale data transforms. Jobs that need local SSD throughput or specialised hardware configurations.

Note

For streaming workloads where data arrives continuously rather than in discrete batches, consider Dataflow instead. Dataflow handles both batch and streaming pipelines and is better suited to continuous data processing at scale.

Cloud Run Jobs

Cloud Run Jobs extends Cloud Run for run-to-completion workloads. You package your logic in a container image, push it to Artifact Registry, and create a job with a single gcloud command. Tasks run in parallel containers, each receiving its index via environment variables, and the job completes when all tasks succeed.

This is the recommended starting point for most batch work. No JSON spec, no VM configuration, minimal operational surface.

Create and run a Cloud Run Job:

# Create a Cloud Run Job
gcloud run jobs create my-job \
  --image=us-central1-docker.pkg.dev/PROJECT_ID/my-repo/processor:v1 \
  --region=us-central1 \
  --tasks=50 \
  --parallelism=5 \
  --max-retries=3 \
  --task-timeout=3600s

Key flags:

—tasks: total number of task instances to run, each gets a unique index
—parallelism: how many tasks run simultaneously (5 here; the rest queue)
—max-retries: how many times to retry a failed or preempted task
—task-timeout: maximum duration per task before it is killed and retried

# Execute the job immediately
gcloud run jobs execute my-job --region=us-central1

# Execute and wait for completion before returning
gcloud run jobs execute my-job \
  --region=us-central1 \
  --wait

# View execution history
gcloud run jobs executions list \
  --job=my-job \
  --region=us-central1

Note

Inside your container, CLOUD_RUN_TASK_INDEX tells each task its position (0-based) and CLOUD_RUN_TASK_COUNT tells it the total number of tasks. Use both to partition the work so no two tasks process the same data.

Cloud Batch

Cloud Batch exists for workloads that need more than Cloud Run Jobs can offer: GPU instances, MPI-coupled tasks, jobs that run for days, or precise control over machine type and storage. It manages the underlying VMs for you but you configure everything through a JSON job specification.

Tip

If you are coming from HPC, scientific computing, or ML training pipelines, Cloud Batch will feel familiar. If this is your first batch workload in GCP, start with Cloud Run Jobs and come back to Cloud Batch only when you hit a requirement it cannot address.

# Enable the Batch API first
gcloud services enable batch.googleapis.com

Define the job in a JSON file:

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "container": {
              "imageUri": "us-central1-docker.pkg.dev/PROJECT_ID/my-repo/processor:v1",
              "commands": ["python", "process.py", "--task-index", "BATCH_TASK_INDEX"]
            }
          }
        ],
        "computeResource": {
          "cpuMilli": 2000,
          "memoryMib": 4096
        },
        "maxRetryCount": 3,
        "maxRunDuration": "3600s"
      },
      "taskCount": 100,
      "parallelism": 10
    }
  ],
  "allocationPolicy": {
    "instances": [
      {
        "policy": {
          "machineType": "e2-standard-4",
          "provisioningModel": "SPOT"
        }
      }
    ]
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

Key fields in the job spec:

taskGroups: one or more groups of tasks; each group shares the same task spec
runnables: the container (or script) that each task executes
computeResource: CPU and memory allocated to each task
maxRetryCount: retries per task on failure or preemption
maxRunDuration: task timeout; tasks running longer than this are killed and retried
taskCount / parallelism: total tasks and how many run at the same time
allocationPolicy: the VM type and provisioning model (SPOT reduces cost significantly)
logsPolicy: where task logs go (CLOUD_LOGGING sends them to Cloud Logging)

# Submit the job
gcloud batch jobs submit my-batch-job \
  --location=us-central1 \
  --config=job.json

# Check job status
gcloud batch jobs describe my-batch-job --location=us-central1

# List all jobs in a region
gcloud batch jobs list --location=us-central1

How task partitioning works

Parallelism only helps if each task processes a different slice of the data. Without partitioning, every task processes the full dataset, producing duplicate results and wasting the entire budget of running multiple workers at once.

Warning

Running 50 parallel tasks that each fetch and process all 50,000 records produces 50 duplicate outputs. This is the single most common mistake in batch job design. Always use the task index to partition work before writing any other logic.

The solution is to use the task index. GCP injects the index into each task as an environment variable:

Cloud Run Jobs: CLOUD_RUN_TASK_INDEX and CLOUD_RUN_TASK_COUNT
Cloud Batch: BATCH_TASK_INDEX and BATCH_TASK_COUNT

Your code reads these values and uses them to select only the items it is responsible for. A simple modulo pattern handles this cleanly even when the total item count does not divide evenly:

import os

task_index = int(os.environ.get("CLOUD_RUN_TASK_INDEX", 0))
task_count = int(os.environ.get("CLOUD_RUN_TASK_COUNT", 1))

# Fetch the full list once, then select only this task's slice
all_items = fetch_all_items()
my_items = [item for i, item in enumerate(all_items) if i % task_count == task_index]

for item in my_items:
    process(item)

Analogy

Processing 1,000 files with 10 tasks is like dealing a deck of 1,000 cards evenly across 10 players. Each player gets exactly 100 cards — their assigned slice — and plays only those. The task index is which player you are. No card is played twice, every card is covered.

For Cloud Batch, replace CLOUD_RUN_TASK_INDEX with BATCH_TASK_INDEX and CLOUD_RUN_TASK_COUNT with BATCH_TASK_COUNT. The pattern is identical.

Using Spot VMs to reduce cost

Spot VMs reduce batch job compute costs by 60–91% compared to on-demand pricing. Both Cloud Run Jobs and Cloud Batch support Spot capacity. The trade-off is that Spot VMs can be preempted at any time, so your job design must account for interruptions.

Two things are required for safe Spot-based batch jobs:

Retries so preempted tasks restart automatically rather than failing permanently.
Checkpoints for long tasks: write progress to Cloud Storage so a restarted task resumes from the last saved point rather than from scratch.

Cloud Run Job on Spot VMs (requires the gen2 execution environment):

gcloud run jobs create my-spot-job \
  --image=IMAGE \
  --region=us-central1 \
  --tasks=100 \
  --parallelism=10 \
  --max-retries=3 \
  --execution-environment=gen2

Cloud Batch Spot configuration (in the allocationPolicy section of job.json):

{
  "allocationPolicy": {
    "instances": [
      {
        "policy": {
          "machineType": "e2-standard-4",
          "provisioningModel": "SPOT"
        }
      }
    ]
  }
}

Cost considerations

Both services can use Spot capacity for significant savings, but the cheapest option is not always the simplest to operate.

Spot savings require retries. Without retry configuration, a single preemption fails the task permanently. Always set retries when using Spot.
Long tasks need checkpointing. A 6-hour Spot task preempted at hour 5 with no checkpointing restarts from zero. Write intermediate results to Cloud Storage at regular intervals.
Cloud Run Jobs reduces operational overhead for smaller teams. Less configuration means fewer things to maintain, debug, and audit. For most workloads, this matters more than squeezing the last percentage point of cost savings.
Cloud Batch is better value for specialised compute. If you genuinely need GPU instances or large-memory VMs, Cloud Batch’s direct VM control and Spot support at that tier can produce better economics.
Parallelism affects time-to-completion, not total cost. Running 100 tasks at parallelism 100 costs the same total compute as parallelism 10. Higher parallelism finishes faster but does not reduce cost.

See cost optimisation strategies in GCP for the broader picture on managing cloud spend.

How to trigger and monitor jobs

Manual execution:

# Trigger a Cloud Run Job immediately
gcloud run jobs execute my-job --region=us-central1 --wait

# Trigger a Cloud Batch job
gcloud batch jobs submit my-batch-job \
  --location=us-central1 \
  --config=job.json

Scheduled execution uses Cloud Scheduler to run jobs on a cron schedule. Cloud Scheduler can call the Cloud Run Jobs execute API directly, or publish a message to Pub/Sub that triggers a downstream process to submit a Cloud Batch job.

Viewing logs:

# Logs for a Cloud Run Job execution
gcloud logging read \
  'resource.type="cloud_run_job" AND resource.labels.job_name="my-job"' \
  --limit=100 \
  --format="value(textPayload)"

# Logs for a Cloud Batch job
gcloud logging read \
  'resource.type="batch.googleapis.com/Job" AND labels."batch.googleapis.com/job_name"="my-batch-job"' \
  --limit=100 \
  --format="value(textPayload)"

Both services write structured logs to Cloud Logging. Use the Logs Explorer to filter by job name, task index, or severity. For persistent visibility, set up log-based metrics and alerts in Cloud Monitoring to notify you when a job fails or takes longer than expected.

Checking job status:

# Cloud Run Job execution status
gcloud run jobs executions describe EXECUTION_NAME --region=us-central1

# Cloud Batch job status
gcloud batch jobs describe my-batch-job --location=us-central1

Common mistakes

Every task processes the full dataset. 50 parallel tasks that each process all 50,000 records produce 50 duplicate outputs and waste 49x the compute. Read CLOUD_RUN_TASK_INDEX or BATCH_TASK_INDEX and use it to select only that task’s assigned slice.
No retries for Spot-based jobs. Spot VMs can be preempted at any point. With retries at 0, a preempted task fails permanently. Set —max-retries (Cloud Run Jobs) or maxRetryCount (Cloud Batch) to at least 2 or 3.
No checkpointing for long-running Spot tasks. A task preempted at hour 5 of a 6-hour run with no checkpointing restarts from the beginning. Write progress to Cloud Storage at regular intervals and check for an existing checkpoint at startup.
No task timeout set. A task stuck in an infinite loop or waiting on a hung external call will run until you cancel it manually. Set maxRunDuration (Cloud Batch) or —task-timeout (Cloud Run Jobs) so stuck tasks are killed and retried automatically.
Choosing Cloud Batch when Cloud Run Jobs would be simpler. Cloud Batch requires a JSON job spec with allocation policies, task groups, and compute resource definitions. For a standard ETL job or report generator, this adds complexity with no benefit. Use Cloud Run Jobs unless you genuinely need what Cloud Batch offers.
Choosing Cloud Run Jobs when GPU or MPI support is needed. If your workload requires GPU-accelerated processing or tight MPI coupling across VMs, Cloud Run Jobs cannot provide it. Recognise this requirement early and reach for Cloud Batch before building around the wrong tool.

Frequently asked questions

What is a batch job in GCP?

A batch job is a unit of work that runs to completion rather than continuously serving requests. Examples include processing overnight data, generating reports, transforming files, or running a model evaluation. In GCP, the two main services for running batch jobs are Cloud Run Jobs (simpler, container-first) and Cloud Batch (VM-level control, GPU support, HPC use cases).

When should I use Cloud Batch instead of Cloud Run Jobs?

Use Cloud Batch when you need GPU instances, MPI parallelism across VMs, very long-running tasks (beyond 24 hours), precise VM configuration (custom machine types, local SSDs), or HPC-style workloads. Use Cloud Run Jobs for most containerised batch work: ETL pipelines, exports, reports, and scheduled data tasks. If Cloud Run Jobs can handle it, use that — the setup is simpler and the operational overhead is lower.

Can batch jobs run on a schedule?

Yes. Both Cloud Run Jobs and Cloud Batch jobs can be triggered on a schedule using Cloud Scheduler. Cloud Scheduler sends an HTTP request or Pub/Sub message to trigger the job at a defined cron interval. This is the standard approach for nightly ETL jobs, daily exports, and recurring reports.

How do parallel tasks avoid processing the same data twice?

Each task reads its index from an environment variable: CLOUD_RUN_TASK_INDEX in Cloud Run Jobs, or BATCH_TASK_INDEX in Cloud Batch. The task uses that index to select its slice of the dataset. For example, with 10 tasks and 1,000 records, task 0 processes records 0-99, task 1 processes 100-199, and so on. The modulo pattern (process item i if i % task_count == task_index) works even when the total count is not evenly divisible.

Can I use Spot VMs for batch jobs in Google Cloud?

Yes. Both Cloud Run Jobs and Cloud Batch support Spot capacity, which reduces compute costs by 60–91%. Spot VMs can be preempted at any time, so always set retries (--max-retries in Cloud Run Jobs, maxRetryCount in Cloud Batch). For long-running tasks, write checkpoints to Cloud Storage so a preempted task can resume from where it left off rather than restarting from scratch.

Last verified: 22 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.

Batch Jobs in GCP: Cloud Run Jobs vs Cloud Batch Explained

Simple explanation

How batch jobs work in GCP

Choosing between Cloud Run Jobs and Cloud Batch

The real trade-off between the two

When to use batch jobs in GCP

Cloud Run Jobs

Cloud Batch

How task partitioning works

Using Spot VMs to reduce cost

Cost considerations

How to trigger and monitor jobs

Common mistakes

Summary

Related topics to read next

Frequently asked questions