How to Run Spark on GCP: Dataproc vs Dataproc Serverless
Running Spark in GCP means submitting PySpark or Spark jobs to Dataproc, Google Cloud’s managed Spark service. You have two options: create a Dataproc cluster and submit jobs to it, or skip the cluster entirely with Dataproc Serverless. Both run the same Spark code and read from Cloud Storage and BigQuery. The fast rule of thumb: use Serverless for jobs that run a few times a day or less, and use a managed cluster when you run many jobs per hour and want to avoid repeated provisioning overhead.
Simple explanation#
Apache Spark is an engine for processing large datasets across many machines in parallel. Instead of running a script on one computer, Spark splits the work across a cluster of machines so it finishes faster.
Dataproc is Google Cloud’s managed version of Spark. You tell GCP how many machines you want, Dataproc sets up the cluster with Spark installed, and you submit jobs to it. When you are done, you delete the cluster. Think of it like renting a fully equipped workshop: you choose the size, use the tools, and return the keys when the job is finished.
Dataproc Serverless removes the workshop entirely. You hand GCP your script, GCP finds the right number of machines, runs the job, and cleans up. You never see or manage the cluster. Think of it like hiring a contractor: describe the work, get the result, and pay only for the time it took.
Both options run the same PySpark code. The difference is how much infrastructure you manage.
Dataproc is like renting a food truck: you pick the size, stock the kitchen, cook the meals, and return it when you are done. Dataproc Serverless is like ordering from a catering app: you describe what you need, someone else handles all the equipment, and you just get the finished plates.
How it works#
The end-to-end flow for running Spark on GCP follows a consistent pattern regardless of which option you choose:
- Write your Spark code. A PySpark script or Spark SQL query that reads, transforms, and writes data.
- Upload the script to Cloud Storage. Dataproc workers pull the script from a
gs://path. - Store input data in Cloud Storage or BigQuery. Workers access data over Google’s internal network.
- Submit the job. Either to a managed cluster (
gcloud dataproc jobs submit) or to Serverless (gcloud dataproc batches submit). - Spark processes the data. The cluster or Serverless runtime distributes work across workers.
- Output lands in Cloud Storage or BigQuery. Results persist after the job ends.
- Monitor in Cloud Console. Job logs, Spark UI, and metrics are available through Cloud Logging and Dataproc’s monitoring interface.
The Spark code is identical in both paths. What changes is step 4: with a managed cluster you create the cluster first and submit to it, while with Serverless you submit directly and GCP handles provisioning.
Both paths produce the same results for the same Spark code. Choosing between them is an operational decision, not a code decision. You can switch from one to the other without rewriting your PySpark scripts.
Prerequisites#
Before running your first Spark job, make sure you have:
- A GCP project with billing enabled. Dataproc uses Compute Engine VMs, which incur charges. See GCP pricing models for how compute billing works.
- The Dataproc API enabled. Enable it through the Cloud Console or the gcloud CLI:
gcloud services enable dataproc.googleapis.com --project=my-project- gcloud CLI installed and authenticated. See the gcloud CLI guide if you have not set this up yet.
- Appropriate IAM permissions. You need at minimum the
roles/dataproc.editorrole to create clusters and submit jobs. See the IAM overview for how roles work. - A Cloud Storage bucket for scripts, input data, staging files, and output. Both Dataproc and Dataproc Serverless need
gs://paths for everything.
gcloud storage buckets create gs://my-project-spark-data --location=us-central1Option 1: Dataproc managed clusters#
A Dataproc managed cluster is a set of Compute Engine VMs with Spark pre-installed and configured. You control the cluster size, machine types, and lifecycle.
When to use a managed cluster#
- You run many jobs per hour and want to avoid repeated provisioning overhead.
- You need interactive Spark sessions via Jupyter notebooks or Spark shell.
- You need fine-grained control over Spark configuration, custom initialization scripts, or specific Spark component versions.
- Your workload uses Spark Streaming or long-running processes.
Create a cluster#
# Create a single-node cluster for development and testing
gcloud dataproc clusters create my-spark-cluster \
--region=us-central1 \
--single-node \
--project=my-projectThe --single-node flag creates a single-VM cluster with no separate workers. This is the cheapest option for development, testing, and small exploratory jobs. For production workloads, omit --single-node and specify worker count and machine types:
# Create a multi-node cluster for production workloads
gcloud dataproc clusters create my-spark-cluster \
--region=us-central1 \
--num-workers=4 \
--master-machine-type=n2-standard-4 \
--worker-machine-type=n2-standard-8 \
--project=my-projectSubmit a job#
gcloud dataproc jobs submit pyspark gs://my-project-spark-data/scripts/transform.py \
--cluster=my-spark-cluster \
--region=us-central1 \
--project=my-projectCluster lifecycle and cost#
A managed cluster costs money for every second it exists, whether or not any jobs are running. A four-node cluster left running overnight accumulates a full night of VM charges.
The standard pattern is ephemeral clusters: create the cluster, run your jobs, delete the cluster. Dataproc clusters provision in 60 to 90 seconds, fast enough for most batch workflows.
# Delete the cluster when done
gcloud dataproc clusters delete my-spark-cluster \
--region=us-central1 \
--project=my-project \
--quietScript cluster deletion as the final step of every batch workflow.
A running cluster costs money whether or not it is doing useful work. A four-node cluster forgotten over a weekend accumulates roughly 48 hours of VM charges per node. Always automate cluster deletion as the last step in your job script.
Custom dependencies and Spark properties#
Install Python packages at cluster creation:
gcloud dataproc clusters create my-spark-cluster \
--region=us-central1 \
--single-node \
--metadata=PIP_PACKAGES="pandas==2.2.0 scikit-learn" \
--project=my-projectOverride Spark configuration with --properties:
gcloud dataproc jobs submit pyspark gs://my-project-spark-data/scripts/transform.py \
--cluster=my-spark-cluster \
--region=us-central1 \
--properties=spark.executor.memory=8g,spark.driver.memory=4g \
--project=my-projectOption 2: Dataproc Serverless#
Dataproc Serverless removes cluster management entirely. Submit a batch job directly and GCP provisions compute, runs the job, and releases resources when it finishes. No cluster to create, size, monitor, or delete.
When to use Serverless#
- Your jobs run a few times a day or less.
- You want zero operational overhead for cluster management.
- You want to pay only for the compute time your job actually uses.
- You are running standard PySpark batch jobs that do not need custom cluster configuration.
Submit a Serverless batch job#
gcloud dataproc batches submit pyspark gs://my-project-spark-data/scripts/transform.py \
--region=us-central1 \
--project=my-projectThat is the entire workflow. No cluster creation, no cluster deletion.
Serverless is an excellent starting point if you are new to Spark on GCP. You can focus on writing correct PySpark code without worrying about cluster sizing or cleanup. Move to managed clusters later if your workload pattern demands it.
How Serverless runs your job#
When you submit a batch, GCP allocates a temporary Spark runtime environment behind the scenes. Your script runs, output is written to Cloud Storage or BigQuery, and the environment is torn down. The entire provisioning step typically takes one to two minutes before execution begins.
Cost behavior#
Dataproc Serverless bills per vCPU-second and per GB-second of memory for the duration of the job. There is no charge for idle time because there is no persistent cluster. For infrequent jobs, this is almost always cheaper than maintaining a cluster. For workloads that submit dozens of jobs per hour, the repeated provisioning overhead and per-job billing may make a persistent cluster more cost-effective. Measure your specific workload pattern to decide.
Serverless capabilities#
Dataproc Serverless supports PySpark and Spark SQL batch jobs, and also supports interactive sessions through Jupyter notebooks integrated with Vertex AI Workbench. For custom Python packages, use a custom container image or pass a requirements file with the --deps-bucket flag.
When to use this#
Running Spark on GCP makes sense when you need distributed data processing and want to use the Spark API. Common use cases:
- Batch ETL. Transform raw data in Cloud Storage into clean, structured formats on a schedule. See ETL vs ELT for how this fits into broader data pipeline strategies.
- Periodic transformations. Aggregate, filter, or reshape data daily or hourly as part of a data pipeline.
- Large-scale PySpark data preparation. Prepare training datasets for ML models, deduplicate records, or join large tables.
- Lake-to-warehouse processing. Read raw files from a data lake in Cloud Storage, transform them, and write results to BigQuery for analytics.
- Exploratory data analysis. Use a single-node cluster or Serverless interactive session to explore datasets with PySpark before building a production pipeline.
- Scheduled production jobs. Run repeat jobs on a cron schedule using Cloud Scheduler, Cloud Composer, or Workflows to trigger Dataproc submissions.
Dataproc vs Dataproc Serverless#
| Dataproc Clusters | Dataproc Serverless | |
|---|---|---|
| Operational overhead | You manage cluster creation, sizing, and deletion | No cluster to manage |
| Startup time | 60–90 seconds for cluster creation | 1–2 minutes per batch job |
| Flexibility | Full control over Spark config, machine types, initialization scripts | Limited configuration options |
| Best for | Frequent jobs, interactive work, Spark Streaming | Infrequent batch jobs, zero-ops workflows |
| Cost shape | Pay for cluster uptime (idle or busy) | Pay per job execution time only |
| Interactive notebooks | Yes, Jupyter on cluster | Yes, via Vertex AI Workbench integration |
| Custom dependencies | Install at cluster creation via metadata or init actions | Custom container images or deps bucket |
Think of Dataproc clusters like owning a delivery van. You pay for it whether you are making deliveries or not, but if you have 50 deliveries a day it is far cheaper than calling a courier each time. Dataproc Serverless is the courier: perfect for a few deliveries a week, but the per-trip cost adds up at high volume.
When Spark is not the right fit#
Not every data processing job on GCP needs Spark. Dataflow (based on Apache Beam) is fully serverless with no cluster lifecycle at all, and is a better fit for new streaming pipelines or teams that do not already use Spark. BigQuery can handle many transformation tasks directly with SQL, avoiding the need for a separate processing engine. Choose Spark on Dataproc when you have existing Spark code, need Spark-specific APIs (MLlib, GraphX, Structured Streaming), or prefer the Spark DataFrame API.
If your transformation logic is pure SQL, try running it directly in BigQuery first. BigQuery handles massive datasets without any infrastructure management, and you only pay for the data scanned. Reach for Spark when you need programmatic transformations, ML libraries, or processing logic that goes beyond what SQL can express cleanly.
Working with Cloud Storage#
All file paths in Dataproc jobs must use Cloud Storage gs:// URIs. Local paths like /home/user/data.csv do not exist on cluster worker VMs. Cloud Storage is persistent, accessible from any cluster or Serverless runtime, and survives cluster deletion.
Reading and writing files#
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("cloud-storage-example").getOrCreate()
# Read Parquet files from Cloud Storage
df = spark.read.parquet("gs://my-project-spark-data/raw/events/")
# Apply transformations
result = (
df
.filter(df.status == "completed")
.groupBy("user_id")
.count()
.withColumnRenamed("count", "completed_event_count")
)
# Write results back to Cloud Storage
result.write.parquet("gs://my-project-spark-data/processed/user-event-counts/")
spark.stop()Upload your script to Cloud Storage before submitting:
gcloud storage cp transform.py gs://my-project-spark-data/scripts/All cluster workers access Cloud Storage directly over Google’s internal network at high throughput. There is no need to copy data to local disk first.
Spark supports Parquet, CSV, JSON, ORC, Avro, and other file formats on Cloud Storage. Parquet is the recommended default for most workloads because it is columnar, compressed, and allows Spark to read only the columns your query needs.
Reading and writing BigQuery from Spark#
Dataproc includes the BigQuery Spark connector pre-installed on cluster images. This lets Spark jobs read BigQuery tables as DataFrames and write DataFrames back to BigQuery without manually exporting data to Cloud Storage first.
Reading from BigQuery#
The connector uses the BigQuery Storage API, which reads column data in parallel across workers for high throughput:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("bq-read-example").getOrCreate()
# Read a BigQuery table into a Spark DataFrame
df = spark.read.format("bigquery") \
.option("table", "my-project.analytics.events") \
.load()
result = df.filter(df.status == "completed").groupBy("user_id").count()Writing to BigQuery#
Writing requires a temporary Cloud Storage bucket where the connector stages data before loading it into BigQuery:
# Write results to a BigQuery table
result.write.format("bigquery") \
.option("table", "my-project.analytics.user_event_counts") \
.option("temporaryGcsBucket", "my-project-spark-temp") \
.mode("overwrite") \
.save()
spark.stop()The connector is included in Dataproc cluster images by default. If you need a specific connector version for compatibility reasons, you can pin it using the spark.jars.packages Spark property, but this is rarely necessary for standard workloads.
The temporaryGcsBucket must already exist and be in the same region as your BigQuery dataset. Create a dedicated temp bucket for this purpose and add a lifecycle rule to auto-delete objects after a few days so staging files do not accumulate.
Common beginner mistakes#
-
Using local file paths instead of
gs://paths. A path like/home/user/data.csvexists only on your local machine. Dataproc workers cannot access it. All input paths, output paths, and script paths must use Cloud Storage URIs (gs://) or BigQuery table references. -
Forgetting to delete clusters after jobs complete. An idle cluster costs the same as a busy cluster. A four-node cluster running overnight with no jobs still accumulates a full night of VM charges. Script cluster deletion as the final step of every batch workflow, or use Dataproc Serverless to avoid this entirely.
-
Missing Python dependencies at runtime. If your PySpark job imports packages not available on Dataproc by default, the job fails immediately. For managed clusters, install packages at cluster creation using
--metadata=PIP_PACKAGES=package1,package2. For Serverless, use a custom container image. -
Region mismatch between cluster and data. Your Dataproc cluster (or Serverless job) and your Cloud Storage bucket should be in the same region. Cross-region data access adds latency and network egress charges. Likewise, the
temporaryGcsBucketfor BigQuery writes must be in the same region as the BigQuery dataset. -
Running out of memory on large jobs. Out-of-memory errors are the most common Spark job failure. Adjust
spark.executor.memoryandspark.driver.memoryusing the--propertiesflag when submitting jobs. For Serverless, increase the batch size with--batch-sizeor use larger worker machine types. -
Submitting jobs before enabling the Dataproc API. The
gcloud dataproccommands fail with a permission error if the Dataproc API is not enabled on your project. Rungcloud services enable dataproc.googleapis.comfirst. See enabling APIs for details.
Summary#
- Two ways to run Spark on GCP: Dataproc managed clusters (you manage the cluster) and Dataproc Serverless (GCP manages everything).
- Use Serverless for infrequent batch jobs where you want zero operational overhead.
- Use managed clusters for frequent jobs, interactive notebooks, or workloads that need custom Spark configuration.
- All file paths must use
gs://Cloud Storage URIs. Local paths do not exist on worker VMs. - Delete clusters immediately after jobs complete to avoid idle charges.
- The BigQuery Spark connector is pre-installed on Dataproc and enables direct reads and writes between Spark DataFrames and BigQuery tables.
- For new pipelines without existing Spark code, consider Dataflow as a fully serverless alternative.
Frequently asked questions
What is the difference between Dataproc and Dataproc Serverless?
Dataproc gives you a managed Spark cluster that you create, size, and delete yourself. Dataproc Serverless removes the cluster entirely: you submit a job, GCP provisions compute automatically, runs the job, and releases the resources when it finishes. Same Spark code, different operational models.
Do I need to create a cluster to run PySpark on GCP?
No. Dataproc Serverless lets you submit PySpark jobs without creating a cluster. Use gcloud dataproc batches submit pyspark with your script path and GCP handles provisioning. A managed cluster is only needed when you want persistent infrastructure for frequent jobs or interactive work.
When should I use Dataflow instead of Spark on GCP?
Use Dataflow when you are building a new pipeline from scratch and want a fully serverless experience with no cluster lifecycle to manage. Use Spark on Dataproc when you have existing Spark or PySpark code, need Spark-specific APIs like MLlib or GraphX, or prefer the Spark DataFrame API over Apache Beam.
Can Spark read from Cloud Storage and BigQuery?
Yes. Spark on Dataproc reads Cloud Storage paths directly using gs:// URIs. For BigQuery, Dataproc includes the BigQuery Spark connector, which lets you read and write BigQuery tables as Spark DataFrames using the BigQuery Storage API for high-throughput parallel access.
How do I handle Python dependencies for Dataproc jobs?
For managed clusters, install packages at cluster creation using the --metadata=PIP_PACKAGES flag or an initialization action script. For Dataproc Serverless, use a custom container image or pass a requirements file with the --deps-bucket flag. Either way, declare dependencies before the job runs.