What Is Google Cloud Dataproc? Spark and Hadoop on GCP

Google Cloud Dataproc is a managed service for running Apache Spark and Hadoop workloads on GCP. It is built for teams that already have Spark or Hadoop code and want to run it in the cloud without rewriting pipelines. Dataproc handles cluster provisioning and configuration, but you still choose the cluster size and control the lifecycle. That makes it different from Dataflow, which is fully serverless and requires no cluster management at all. If you are starting a new pipeline from scratch, Dataflow with Apache Beam is usually the simpler choice. If you have existing Spark code or need Spark-specific features like MLlib or GraphX, Dataproc is the right service.

Dataproc in simple terms

Dataproc gives you a Spark or Hadoop cluster on GCP that is ready to use in about 90 seconds. You tell GCP how many machines you want, what size they should be, and what software to install. GCP creates the cluster, configures everything, and hands you the controls. You submit your Spark or Hadoop jobs, get the results, and delete the cluster when you are done.

The “managed” part means GCP handles the installation, configuration, patching, and teardown of the cluster infrastructure. The part that is not managed is the decision-making: how many nodes, what machine types, when to start, and when to stop. That operational responsibility stays with you.

Think of it this way

Dataproc is like renting a fully equipped workshop. The tools are installed, the power is connected, and the space is cleaned up when you leave. But you still decide which tools to use, how many workbenches you need, and when to open and close the shop. Dataflow, by contrast, is like hiring a contractor: describe the job you want done and they handle the rest.

What Dataproc is

Dataproc is managed Spark and Hadoop infrastructure. It is not a new processing engine or a proprietary framework. Your existing PySpark scripts, Scala Spark jobs, Hive queries, and MapReduce programs run on Dataproc without code changes. The cluster runs the same open-source software you would install yourself, but GCP handles the setup and teardown.

What “managed” covers:

  • Automated provisioning of master and worker VMs
  • Pre-installed Spark, Hadoop, Hive, Pig, and related components
  • Integration with Cloud Storage, BigQuery, and other GCP services
  • Cluster deletion that cleanly removes all VMs and local storage

What “managed” does not cover:

  • Choosing the right cluster size for your workload
  • Deciding when to create and delete clusters
  • Tuning Spark configuration for your jobs
  • Monitoring job performance and scaling accordingly

If you want a service that handles all of those decisions for you, look at Dataflow or Dataproc Serverless instead.

How Dataproc works

A typical Dataproc workflow follows these steps:

  1. Store your data in Cloud Storage. Upload input files to a Cloud Storage bucket using gs:// paths. Do not rely on HDFS for persistent data. HDFS is local to the cluster and deleted with it.
  2. Create a cluster or submit a serverless batch. For cluster mode, run gcloud dataproc clusters create and specify the region, machine types, and worker count. For Dataproc Serverless, skip this step entirely.
  3. Submit your Spark or Hadoop job. Use gcloud dataproc jobs submit pyspark (cluster mode) or gcloud dataproc batches submit pyspark (serverless). Reference your script and data by their gs:// paths.
  4. Job runs on the cluster. Dataproc distributes the work across workers. Spark reads from Cloud Storage, processes the data, and writes output.
  5. Write outputs to Cloud Storage or BigQuery. Results persist independently of the cluster. Spark can write directly to BigQuery using the pre-installed BigQuery Spark connector, which is useful when your processed data feeds into analytical queries downstream.
  6. Delete the cluster (cluster mode only). Run gcloud dataproc clusters delete to release all compute resources and stop billing. With Dataproc Serverless, cleanup happens automatically.
Tip

Script the entire create-run-delete cycle as a single workflow. A Cloud Composer DAG or a simple shell script that creates the cluster, submits the job, waits for completion, and deletes the cluster ensures you never leave a cluster running by accident.

When to use Dataproc

  • Existing Spark or Hadoop workloads. You have PySpark, Scala Spark, Hive, or MapReduce code that works and you want to run it on GCP without rewriting it.
  • Spark-specific features. Your workload uses MLlib for machine learning, GraphX for graph processing, or Spark Structured Streaming where the Spark API is broader than Apache Beam.
  • Lift-and-shift migration. You are moving an on-premises Hadoop or Spark cluster to GCP and want minimal disruption to existing pipelines.
  • Interactive exploration. You need Jupyter notebooks running on a Spark cluster for exploratory data analysis before packaging jobs for production.
  • Large-scale batch processing. Your ETL or ELT pipeline processes terabytes of data using Spark’s distributed engine and writes results to a data lake or data warehouse.

When not to use Dataproc

  • New pipelines from scratch. If you are writing a new data pipeline and have no existing Spark code, Dataflow with Apache Beam is simpler. No cluster to manage, no sizing decisions, no lifecycle to automate.
  • Simple Cloud Storage to BigQuery ETL. For straightforward file transformations and loads, a Dataflow template or BigQuery load job is less operational overhead than standing up a Spark cluster.
  • Real-time streaming with autoscaling. Dataflow autoscales workers within seconds based on message backlog. Dataproc cluster resizing is slower and requires manual or scripted intervention. For streaming pipelines with variable throughput, Dataflow is the better fit.
  • SQL-only analytics. If your workload is SQL queries on structured data, BigQuery is serverless, requires no cluster, and scales automatically. You do not need Spark for SQL analytics.

Dataproc vs Dataflow vs Dataproc Serverless

These three services cover different points on the managed-to-serverless spectrum. The right choice depends on your existing code, your team’s expertise, and how much cluster management you want to handle.

Dataproc (cluster mode)Dataproc ServerlessDataflow
Management modelYou create, size, and delete clustersNo cluster to manage; submit jobs directlyFully serverless; no cluster concept
Best use caseExisting Spark/Hadoop code, Spark-specific featuresInfrequent Spark batch jobs without cluster overheadNew pipelines, streaming, Apache Beam workloads
Code modelPySpark, Scala Spark, Hive, MapReducePySpark, Spark SQLApache Beam (Python, Java, Go)
Pricing shapePer-VM billing while cluster existsPer vCPU-second and memory-second during jobPer vCPU-second, memory, and disk while workers run
Operational overheadHighest: sizing, lifecycle, tuningLow: submit and forgetLowest: fully managed autoscaling

For a deeper look at the Spark side, see Running Spark in GCP. For the Beam and Dataflow side, see Dataflow Overview.

Dataproc architecture and key components

Cluster mode

A Dataproc cluster consists of a master node and one or more worker nodes, all running as Compute Engine VMs. The master coordinates job scheduling and resource management. Workers execute the distributed processing tasks.

  • Primary workers are standard VMs that persist for the life of the cluster.
  • Secondary workers (optional) can be Spot VMs for significant cost savings on batch workloads that tolerate interruptions.
  • Single-node clusters run everything on one VM. Useful for development and testing, not production workloads.

Cloud Storage vs HDFS

Dataproc clusters include HDFS by default, but you should use Cloud Storage (gs://) for all persistent data. Cloud Storage survives cluster deletion, is accessible from any cluster or GCP service, and offers high durability. HDFS is local to the cluster: delete the cluster and the HDFS data is gone.

The only practical use for HDFS on Dataproc is temporary intermediate files within a single job execution. For everything else (input data, output data, scripts, configuration), use Cloud Storage.

Warning

HDFS data does not survive cluster deletion. If you write job output to HDFS on an ephemeral cluster and then delete that cluster, the output is permanently lost. Always use gs:// paths for any data you need to keep.

Job types

Dataproc supports PySpark, Spark (Scala/Java), SparkR, Spark SQL, Hive, Pig, and Hadoop MapReduce jobs. Most teams use PySpark or Spark SQL. For serverless batches, only PySpark and Spark SQL are supported.

Cost and operational best practices

Use ephemeral clusters

The most impactful cost practice for Dataproc is ephemeral clusters. Create a cluster when a job needs to run, execute the job, and delete the cluster immediately. You pay only for the time the cluster exists. A daily batch job that runs for 2 hours costs 2 hours of cluster time, not 24.

Dataproc clusters provision in under 90 seconds. This is fast enough for any scheduled batch workflow. Script the create-run-delete pattern as part of your pipeline orchestration.

Avoid idle clusters

A cluster with 8 worker nodes running overnight and on weekends accumulates compute cost whether or not any jobs are running. If you are not actively submitting jobs, delete the cluster. For workloads that run a few times a day, ephemeral clusters or Dataproc Serverless eliminate idle cost entirely.

Use Spot or secondary workers

Spot VMs as secondary workers can significantly reduce cost for batch workloads. Spot VMs are preemptible, meaning GCP can reclaim them with short notice. Spark handles this gracefully by reassigning tasks from lost workers to surviving ones. For batch jobs that do not need guaranteed completion times, this trade-off is worth it.

Right-size your clusters

Start with a smaller cluster and monitor resource utilisation. Scale up only if jobs are CPU- or memory-constrained. Over-provisioning “just in case” is one of the most common sources of wasted spend on Dataproc. For broader cost strategies, see cost optimisation strategy.

Consider Dataproc Serverless for infrequent jobs

If a Spark job runs once or twice a day, the provisioning overhead of Dataproc Serverless is negligible compared to the cost of keeping a cluster running between jobs. For jobs that run many times per hour, a persistent cluster may be cheaper. Measure both options against your actual workload before committing.

Cost rule of thumb

If your cluster sits idle for more than half its running time, you are overpaying. Switch to ephemeral clusters or Dataproc Serverless. If your cluster runs jobs back-to-back with little idle time, a persistent cluster is likely more cost-effective than repeated provisioning.

Common beginner mistakes

  1. Keeping clusters running when no jobs are executing. A persistent cluster with several worker nodes running overnight and on weekends accumulates significant cost with zero job activity. Delete clusters when jobs complete. Cluster creation takes 60 to 90 seconds, fast enough for any scheduled batch workflow.
  2. Storing job output in HDFS on an ephemeral cluster. If you store output in HDFS and then delete the cluster, the data is gone. Always write job output to Cloud Storage (gs://). Use HDFS only for temporary intermediate files within a single job.
  3. Over-provisioning cluster size upfront. Starting with a large cluster because “more is faster” is expensive. Start smaller, monitor execution and resource utilisation, and scale up only if jobs are slot- or memory-constrained.
  4. Using Dataproc for simple ETL when Dataflow would be simpler. If your pipeline reads files from Cloud Storage, applies straightforward transformations, and writes to BigQuery, a Dataflow template requires no cluster management. Reach for Dataproc when Spark-specific capabilities genuinely justify the operational overhead.
  5. Using local file paths in Spark scripts. A path like /home/user/data.csv does not exist on Dataproc worker VMs. All input and output paths must be Cloud Storage paths (gs://) or BigQuery table references.

Frequently asked questions

When should I use Dataproc instead of Dataflow?

Use Dataproc when you have existing Spark or Hadoop code you want to run on GCP without rewriting it. Use Dataflow for new pipelines written from scratch using Apache Beam, where you want fully serverless execution with no cluster to manage.

What is Dataproc Serverless for Spark?

Dataproc Serverless lets you submit PySpark or Spark SQL jobs without creating or managing a cluster. GCP provisions compute automatically, runs the job, and releases the resources when it finishes. You pay only for job execution time, billed per vCPU-second and memory-second.

Should I use HDFS or Cloud Storage with Dataproc?

Use Cloud Storage for all job input and output data. Cloud Storage persists after the cluster is deleted, while HDFS data is lost when an ephemeral cluster is removed. Use HDFS only for temporary intermediate files within a single job execution.

How long does it take to create a Dataproc cluster?

Dataproc clusters typically provision in 60 to 90 seconds. This is fast enough to create a cluster at the start of a scheduled batch job and delete it when the job completes, eliminating idle cost.

Is Dataproc fully serverless?

Standard Dataproc is not fully serverless. You choose the cluster size, machine types, and manage the cluster lifecycle. Dataproc Serverless for Spark removes cluster management for batch Spark jobs but does not cover all workload types. Dataflow is the fully serverless option for Apache Beam pipelines.

Last verified: 26 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.