What Is Google Cloud Dataflow? Apache Beam, Use Cases, Pricing
Google Cloud Dataflow is a fully managed service for running data processing pipelines. You write pipeline logic using the Apache Beam SDK in Python, Java, or Go. Dataflow handles worker provisioning, parallel execution, autoscaling, and failure recovery. You never create or manage a cluster.
Dataflow is the right choice when you need to run batch ETL jobs, build streaming pipelines, or apply complex transformations to large datasets on GCP. It is not the right choice for low-volume event handling where a simpler service like Cloud Run would cost less and be easier to operate.
Simple explanation
Dataflow has two parts: the programming model and the execution engine.
The programming model is Apache Beam. Beam is an open-source SDK that lets you describe a series of data transformations: read from a source, apply logic to each record, write to a destination. You write this as ordinary Python or Java code.
The execution engine is Dataflow itself. When you submit a Beam pipeline to Dataflow, Google’s infrastructure compiles your pipeline graph, spins up worker VMs, distributes data across them, and tears everything down when the job finishes. Streaming jobs stay running until you stop them.
Beam and Dataflow are separate things. Beam is the recipe. Dataflow is the kitchen that follows the recipe at scale. You can test a Beam pipeline on your laptop using the DirectRunner and then submit the same code to Dataflow for production with no changes needed.
Writing a Beam pipeline and submitting it to Dataflow is like writing a recipe and handing it to a catering company. You describe what to cook. They handle the kitchen, equipment, staffing, and clean-up. You get the finished output.
How Dataflow works
When you submit a pipeline, Dataflow goes through these steps:
- Graph compilation. Dataflow reads your Beam pipeline code and builds an optimised execution graph. It may fuse adjacent steps together to reduce data shuffling between workers.
- Worker provisioning. Dataflow launches worker VMs in your GCP project and region. You do not choose the number of initial workers. Dataflow decides based on the input size and pipeline shape.
- Parallel execution. Workers pull tasks from a coordination service, process data partitions in parallel, and write results to the output sinks you defined (for example, BigQuery tables or Cloud Storage buckets).
- Autoscaling. Dataflow monitors throughput and backlog. For batch jobs, it scales workers based on remaining work. For streaming jobs, it scales based on message backlog. You can set
—max_num_workersto cap the upper limit. - Failure handling. If a worker fails, Dataflow retries the affected work on another worker. You do not need to write retry logic for infrastructure failures.
- Completion. Batch jobs finish when all data is processed and workers are shut down. Streaming jobs run continuously until you drain or cancel them.
Batch and streaming use the same Beam SDK. Batch pipelines process bounded datasets with a defined start and end. Streaming pipelines process unbounded data, typically messages arriving through Pub/Sub, and run until stopped. You choose the mode when you submit the job, not when you write the code.
Think of autoscaling like a conveyor belt at a shipping warehouse. When packages pile up, more workers are pulled from other stations to clear the backlog. When the rush passes, those workers go back. You set a maximum headcount, but you never assign people to stations yourself.
When to use Dataflow
Batch ETL. Read files from Cloud Storage, apply transformations (parse, clean, enrich, join), and write results to BigQuery or another destination. This is the most common Dataflow use case. If you are building an ETL or ELT pipeline on GCP, Dataflow is the standard tool for the transform step.
Streaming ingestion. Consume messages from Pub/Sub, transform them in flight, and write to BigQuery continuously. This gives you near-real-time analytics without batch delays. The Pub/Sub, Dataflow, BigQuery pattern is the default streaming pipeline architecture on GCP.
Large-scale data transformation. Apply complex business logic to large datasets: joining multiple sources, computing aggregations, reshaping nested data, or applying ML model predictions. Dataflow distributes the work across workers automatically.
Reusable pipeline deployment. Package pipelines as Flex Templates so non-engineers can run parameterised data jobs from the console or API. Useful when the same pipeline logic needs to run on different datasets or schedules.
Ask two questions: (1) Does my data need transformation before it reaches its destination? (2) Is the volume high enough that a single machine cannot handle it? If both answers are yes, Dataflow is a strong fit.
When not to use Dataflow
Dataflow adds overhead (startup time, worker provisioning, per-second billing) that is not justified for every workload. Consider alternatives in these situations:
- Low-throughput event handling. If you are processing a few hundred events per minute with straightforward logic (parse, validate, write), a Pub/Sub push subscription to Cloud Run is simpler and cheaper. Dataflow is worth its overhead when you need windowing, aggregation, or high throughput.
- Simple loads into BigQuery. If your data does not need transformation, a direct batch load from Cloud Storage or the BigQuery Data Transfer Service avoids the complexity of a pipeline. See loading data into BigQuery for options.
- Existing Spark or Hadoop workloads. If you already have PySpark or Hadoop MapReduce jobs, Dataproc lets you run them on GCP with minimal code changes. Rewriting to Beam only makes sense if you also want the serverless operational model.
- Service-to-service processing. For event-driven patterns where one service emits an event and another reacts to it, Cloud Run or Cloud Functions are a better fit than Dataflow.
A common trap is using Dataflow for everything just because it works. A streaming Dataflow job that runs 24/7 to process 50 events per minute will cost far more than a Cloud Run service doing the same work. Match the tool to the throughput and complexity of the workload.
Common use cases
Cloud Storage to BigQuery ETL. Read CSV, JSON, Avro, or Parquet files from Cloud Storage, clean and transform the data, and write to BigQuery tables. Pre-built Dataflow templates handle this pattern with zero custom code.
Pub/Sub to BigQuery streaming. Consume messages from a Pub/Sub topic as they arrive, parse and enrich each message, and write to BigQuery continuously. This gives you a near-real-time analytics pipeline.
Data enrichment and joining. Read a primary dataset, join it with reference data from another source (a BigQuery table, a Cloud Storage file, or an external API), and output the enriched result.
Data migration. Move and transform data between systems with exactly-once semantics and built-in progress tracking. Dataflow handles checkpointing so you do not have to build custom resume logic.
Log processing. Parse structured and semi-structured logs at scale, extract fields, apply filters, and route results to BigQuery for analysis or Cloud Storage for archival.
Dataflow templates
Google provides a library of pre-built Dataflow templates for common pipeline patterns. You run a template from the console, CLI, or API by providing configuration parameters. No Beam code required.
Frequently used templates include Cloud Storage Text to BigQuery, Pub/Sub to BigQuery, Cloud Storage Avro to BigQuery, BigQuery to Cloud Storage, and Pub/Sub to Cloud Storage. These cover the majority of ETL and streaming ingestion patterns.
Flex Templates let you package your own custom Beam pipelines as container images and reuse them with the same interface as Google’s built-in templates. Teams use Flex Templates to standardise pipeline deployment across projects and let non-engineers run parameterised data jobs.
Before writing a custom Beam pipeline, check the Dataflow template library. Many common patterns are already implemented and tested. Starting with a template saves development time and avoids common pitfalls like watermark misconfiguration in streaming pipelines.
Dataflow templates are like meal kits. The recipe and pre-measured ingredients come in the box. You just provide the delivery address (your BigQuery table) and any preferences (column mappings, filters). Flex Templates let you design your own meal kit so your team can reuse it without starting from scratch each time.
Dataflow vs Dataproc
Both services process large datasets, but the operational model and programming model are different. The right choice depends on what code you already have and how much infrastructure you want to manage.
| Dataflow | Dataproc | |
|---|---|---|
| Programming model | Apache Beam (Python, Java, Go) | Apache Spark, Hadoop, Hive, Presto |
| Operational burden | Serverless, no cluster to create or manage | Managed clusters, you choose size and lifecycle |
| Scaling | Automatic autoscaling per job | Manual sizing or autoscaling policy you configure |
| Best use case | New ETL/streaming pipelines with no existing Spark code | Existing Spark or Hadoop workloads migrating to GCP |
| Poor fit | Teams with large existing Spark codebases | Teams that want zero cluster management |
Choose Dataflow for new pipelines written from scratch. Choose Dataproc when you have existing Spark or Hadoop code and want to run it on GCP with minimal changes. See running Spark in GCP for a hands-on comparison.
This is not an either/or decision for an entire organisation. Many teams run Dataflow for new streaming pipelines while keeping their existing Spark batch jobs on Dataproc. Choose per workload, not per team.
Pricing and cost drivers
Dataflow pricing has two components: the compute resources your workers use and a Dataflow service fee.
- Worker resources. You pay per second for the vCPUs, memory, and persistent disk (or SSD) used by worker VMs while a job is running. More workers or larger machine types increase cost proportionally.
- Dataflow service fee. A per-vCPU-hour surcharge on top of the underlying compute cost. This is the premium for the managed, serverless experience.
- Batch vs streaming cost behavior. Batch jobs spin up workers, process data, and shut down. You pay only for the duration of the job. Streaming jobs run continuously, so workers accumulate cost around the clock. A streaming job that runs 24/7 with 4 workers costs significantly more per month than a batch job that runs for 30 minutes daily.
- Data egress. Moving data between regions incurs network egress charges. Running Dataflow workers in the same region as your data sources and sinks avoids this.
Streaming jobs are the most common source of unexpected Dataflow bills. Because they run continuously, even a small pipeline with a few workers can accumulate significant monthly cost. Always set —max_num_workers and monitor job metrics in the Dataflow console to catch inefficient pipelines early.
To control costs: right-size your —max_num_workers limit, keep workers in the same region as your data, use batch mode when real-time results are not required, and review job metrics regularly. For the destination side, see BigQuery pricing to understand the downstream cost of writing large volumes of data.
Common beginner mistakes
- Running workers in a different region from your data. Dataflow workers should run in the same region as your Cloud Storage buckets and BigQuery datasets. Cross-region transfer adds latency and egress cost. Always specify
—regionwhen submitting a job and match it to your data location. - Skipping autoscaling limits on streaming jobs. Without a
—max_num_workersceiling, a traffic spike can cause Dataflow to scale to far more workers than expected, producing a large surprise bill. Set a maximum that matches your expected peak throughput. - Testing only in production. The Beam DirectRunner executes pipelines locally with a small data sample. Use it during development to catch logic errors in seconds without incurring cloud costs or waiting for Dataflow job startup.
- Using Dataflow when a simpler service would work. For low-volume events with basic processing, a Pub/Sub push subscription to Cloud Run costs less and deploys faster. Reserve Dataflow for workloads that need windowing, aggregation, joins, or high throughput.
- Writing a custom pipeline before checking templates. The Dataflow template library covers many standard patterns. Check it first. Using a tested template avoids bugs and reduces time to production.
Summary
- Dataflow is GCP’s managed service for running Apache Beam pipelines. You write Beam code; Dataflow provisions workers, handles failures, and autoscales.
- Supports both batch (bounded) and streaming (unbounded) pipelines using the same Beam SDK.
- Common patterns: Cloud Storage to BigQuery ETL, Pub/Sub to BigQuery streaming, data enrichment, and log processing.
- Use Dataflow templates for standard patterns before writing custom pipelines.
- Choose Dataflow for new pipelines. Choose Dataproc for existing Spark or Hadoop code.
- Streaming jobs run continuously and cost more than equivalent batch jobs. Set
—max_num_workersand keep workers in the same region as your data.
Frequently asked questions
What is the difference between Dataflow and Apache Beam?
Apache Beam is the open-source programming model and SDK you use to write pipeline code. Dataflow is the Google Cloud managed service that executes Beam pipelines. You write Beam code locally, test it with the DirectRunner, then submit it to Dataflow for production. Beam pipelines can also run on other runners like Flink or Spark, but Dataflow is the standard runner on GCP.
Does Dataflow handle both batch and streaming?
Yes. The same Apache Beam SDK covers both. Batch pipelines process bounded data with a defined start and end. Streaming pipelines process unbounded data continuously. Dataflow manages infrastructure for both modes and autoscales workers based on workload.
How is Dataflow different from Dataproc?
Dataflow is serverless: you submit Apache Beam code and Dataflow manages provisioning, scaling, and teardown. Dataproc provides managed Spark and Hadoop clusters where you choose machine types, cluster size, and lifecycle. Use Dataflow for new pipelines. Use Dataproc when you have existing Spark or Hadoop code to migrate.
Do I need to write Apache Beam code to use Dataflow?
Not always. Google provides pre-built Dataflow templates for common patterns like Cloud Storage to BigQuery or Pub/Sub to BigQuery. You run a template by providing parameters, with no Beam code required. For custom logic, you write Beam pipelines in Python, Java, or Go.
When should I avoid Dataflow?
Dataflow adds overhead that is not justified for every workload. If you are processing a few hundred events per minute with simple logic, a Pub/Sub push subscription to Cloud Run is simpler and cheaper. If you already have Spark jobs, Dataproc avoids a rewrite. For straightforward loads into BigQuery without transformation, a direct batch load or the Data Transfer Service may be enough.