Apache Beam Basics on GCP: Pipelines, PCollections, Transforms, and Dataflow

Apache Beam is an open-source SDK for building data pipelines that process both batch files and real-time streams. You write your pipeline logic once in Python, Java, or Go, and then run it on different execution engines without changing the code. On GCP, the standard way to run Beam pipelines in production is Dataflow, a fully managed service that handles worker provisioning, scaling, and fault tolerance for you.

This page covers the core concepts you need to understand before writing your first Beam pipeline: what a Pipeline, PCollection, and PTransform are, how runners work, and when Beam is the right tool for the job.

Simple explanation

Apache Beam is how you describe a data pipeline in code. You say: read data from here, transform it like this, and write the results over there. Beam turns that description into a graph of steps.

Dataflow is how Google Cloud runs that pipeline at scale. It takes the graph Beam built and distributes the work across as many machines as needed, handling failures and scaling automatically.

The two work together but are separate things. Beam is the programming model. Dataflow is one of several execution engines (called runners) that can execute Beam code.

Why Apache Beam exists

Before Beam, building a data pipeline meant choosing an execution engine first and writing code tightly coupled to it. If you wrote a Spark job, your code only ran on Spark. If you wanted to move to a different engine, you rewrote the pipeline from scratch. Testing required spinning up the same engine locally or deploying to a cluster and waiting.

Beam solves this by separating two concerns:

  • Pipeline logic: what data to read, how to transform it, and where to write the results.
  • Execution: how to distribute that work across machines, manage parallelism, and handle failures.
Analogy

Writing a Beam pipeline is like writing a recipe. You list every step: chop onions, heat the pan, cook for ten minutes, plate the dish. The recipe is complete before any cooking happens. The runner is the chef who executes the recipe. You can hand the same recipe to a home cook (DirectRunner) or a professional kitchen (Dataflow). The instructions do not change. Only the execution environment does.

You write the pipeline logic once. The runner handles execution. During development, you run the pipeline locally with the DirectRunner and get results in seconds. In production, you switch to DataflowRunner and the same code runs across hundreds of workers. If you later need to run on Apache Flink, you change the runner again without touching pipeline logic.

Beam also provides a single programming model that works for both batch and streaming data. Instead of learning one framework for files and another for message streams, you use the same PCollections and transforms for both. This unified model is one of the main reasons GCP teams adopt Beam for data pipeline design.

How Apache Beam works

A Beam pipeline follows a consistent flow from source to sink:

  1. Read data from a source. The pipeline reads input from a file in Cloud Storage, a BigQuery table, a Pub/Sub subscription, or any supported connector.
  2. Store it in a PCollection. The data becomes a PCollection: a distributed, parallel dataset that Beam manages across workers.
  3. Apply transforms. You chain PTransforms (Map, Filter, GroupByKey, Combine, and others) to parse, clean, enrich, aggregate, or reshape the data. Each transform takes a PCollection as input and produces a new PCollection as output.
  4. Send the pipeline to a runner. The runner receives the full pipeline graph and decides how to execute it. The DirectRunner runs everything locally. The DataflowRunner submits the graph to Dataflow for distributed execution.
  5. Write results to a sink. The final PCollection is written to an output destination: a file, a BigQuery table, a Pub/Sub topic, or a database.
This catches people off guard

None of this executes line by line. Beam builds the entire graph first, then hands it to the runner for execution. Your code is a description of what to do, not an imperative script that runs step by step. This means you cannot add a print() between steps and expect to see intermediate results the way you would in a normal Python script.

Core concepts

Pipeline

A Pipeline is the top-level object that represents your entire data processing job. It connects sources, transforms, and sinks into a directed acyclic graph (DAG). In Python, you create a pipeline as a context manager using with beam.Pipeline() as p. Everything inside the with block adds steps to the graph. The pipeline does not execute until you exit the context.

Think of a pipeline as a blueprint. You describe every step the data goes through before any data actually moves. The runner reads the blueprint and executes it.

PCollection

A PCollection (parallel collection) is the data flowing through a pipeline. Every transform takes one or more PCollections as input and produces a new PCollection as output. PCollections are the edges in the pipeline graph.

Example: reading a CSV file produces a PCollection of strings (one per line). Applying a Map transform to parse each line produces a new PCollection of parsed records.

Not a Python list

A PCollection is not a Python list. It represents data distributed across many workers. You cannot index into it, loop over it in driver code, or hold it in memory on one machine. The only way to work with the data inside a PCollection is to apply a PTransform. Trying to print() or iterate over a PCollection in your main script will either fail or produce an empty result.

PTransform

A PTransform is an operation applied to a PCollection. Beam provides built-in transforms for the most common operations:

  • Map: apply a function to each element individually.
  • Filter: keep only elements that match a condition.
  • FlatMap: like Map, but each input element can produce zero or more output elements.
  • GroupByKey: group elements by key, producing key-value pairs where each key maps to all its values.
  • Combine: aggregate values (sum, count, average) more efficiently than GroupByKey followed by manual aggregation.
  • Flatten: merge multiple PCollections of the same type into one.

You chain transforms together using the | operator. Each transform produces a new PCollection, so you can build long processing chains.

DoFn

A DoFn (Do Function) is a user-defined function that processes elements one at a time. The ParDo transform applies a DoFn to every element in a PCollection, in parallel across workers. You define a class that extends beam.DoFn and implement a process method that receives one element and yields zero or more output elements.

When to use Map vs DoFn

For simple one-liner transformations, beam.Map(lambda x: x.upper()) is cleaner and easier to read. Switch to a full DoFn class when you need setup/teardown logic (like opening a database connection once per worker), side inputs, error handling per element, or multiple output collections.

Bounded vs unbounded data

A bounded PCollection has a defined beginning and end. Examples: a CSV file, a database export, a set of Avro files in Cloud Storage. Beam processes bounded data as a batch job that starts, processes everything, and finishes.

An unbounded PCollection has no end. Examples: messages arriving on a Pub/Sub topic, events from a Kafka stream. Beam processes unbounded data as a streaming pipeline that runs continuously, using windowing and watermarks to group data into finite chunks for processing.

The same transforms work on both bounded and unbounded PCollections. Whether your pipeline runs as batch or streaming depends on the data source, not on your transform code.

Runners

A runner is the execution engine that takes your pipeline graph and runs it. Beam supports multiple runners:

  • DirectRunner: runs on your local machine using threads. Use it for development and testing. It faithfully emulates Beam semantics, so bugs caught locally will behave the same on Dataflow.
  • DataflowRunner: submits the pipeline to Google Cloud Dataflow for fully managed distributed execution.
  • FlinkRunner / SparkRunner: runs on Apache Flink or Spark clusters for teams that already operate those engines.

You switch runners by changing a pipeline option. The pipeline code itself does not change. This portability is one of Beam’s core design goals.

Apache Beam and Dataflow

This is the single most important distinction for beginners to understand:

The key distinction

Apache Beam = the open-source SDK and programming model. You use it to write pipeline code.
Dataflow = the managed GCP service. It runs Beam pipelines at scale without you managing any infrastructure.
You write Beam code. Dataflow runs it. They are not the same thing, even though Google created Beam based on the internal technology behind Dataflow.

During development, you use the DirectRunner to test pipelines locally with small data samples. This gives you fast feedback without provisioning any cloud resources. When you are ready for production, you switch to the DataflowRunner. Dataflow provisions workers, distributes the work, autoscales based on load, and shuts everything down when the job finishes (for batch) or scales down during quiet periods (for streaming).

The pipeline code is identical in both cases. Only the runner option changes. For details on what Dataflow manages, see the Dataflow Overview.

A basic pipeline in Python

This pipeline reads order records from a CSV file in Cloud Storage, filters out cancelled orders, extracts the order total, and writes the results to a new file.

import apache_beam as beam

with beam.Pipeline() as pipeline:
    (
        pipeline
        | 'Read orders' >> beam.io.ReadFromText(
            'gs://my-shop-data/orders/*.csv',
            skip_header_lines=1
        )
        | 'Parse CSV rows' >> beam.Map(lambda line: line.split(','))
        | 'Remove cancelled' >> beam.Filter(lambda fields: fields[3] != 'CANCELLED')
        | 'Extract total' >> beam.Map(lambda fields: f"{fields[0]},{fields[4]}")
        | 'Write results' >> beam.io.WriteToText(
            'gs://my-shop-data/output/active-orders'
        )
    )

What each part does

with beam.Pipeline() as pipeline: creates a Pipeline object as a context manager. Everything inside the block builds the pipeline graph. The pipeline runs when the with block exits. Without extra options, this uses the DirectRunner by default.

| (pipe operator): applies a transform to the PCollection produced by the previous step. Each | takes the output PCollection from the left side and feeds it into the transform on the right side.

>> (right-shift operator): assigns a human-readable name to the step. The string before >> becomes the step label in pipeline graphs and job logs. Every step name must be unique within the pipeline.

ReadFromText: reads files from the given path and produces a PCollection of strings, one per line. The skip_header_lines=1 skips the CSV header row.

beam.Map: applies a function to each element. Here, it splits each line on commas to produce a list of fields.

beam.Filter: keeps only elements where the function returns True. This drops any row where the status field (index 3) is CANCELLED.

Second beam.Map: extracts the order ID (index 0) and total (index 4) into a comma-separated string for the output file.

WriteToText: writes the final PCollection to files in Cloud Storage.

Switching to Dataflow

To run this same pipeline on Dataflow instead of your local machine, pass pipeline options specifying the runner, project, and region. The pipeline logic stays identical:

from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(
    runner='DataflowRunner',
    project='my-gcp-project',
    region='us-central1',
    temp_location='gs://my-shop-data/temp'
)

with beam.Pipeline(options=options) as pipeline:
    # Exact same pipeline code as above
    ...

The pipeline logic does not change. Only the options change. This is the portability that Beam provides.

When to use Apache Beam

Beam is a strong choice when your pipeline needs one or more of these characteristics:

  • Batch ETL at scale. Reading files from Cloud Storage, transforming them, and loading the results into BigQuery. This is the most common ETL pattern on GCP.
  • Real-time streaming. Consuming messages from Pub/Sub, transforming each message, and writing to BigQuery or Cloud Storage continuously. Beam handles windowing, late data, and exactly-once processing.
  • Unified batch and streaming. When the same business logic needs to run on both historical data (batch) and live data (streaming), Beam lets you write that logic once.
  • Complex multi-step transformations. Joining multiple data sources, applying business rules, computing aggregations, and writing to multiple sinks in a single pipeline.
  • Portability. When you want the option to run the same pipeline on Dataflow today and on Flink or Spark later without rewriting.
  • Local testing before cloud deployment. The DirectRunner lets you validate pipeline logic in seconds before submitting to Dataflow.

When Apache Beam may be overkill

Check simpler options first

Beam adds real complexity. Before reaching for it, ask whether a simpler tool solves the problem.

Cases where Beam is probably not the right starting point:

  • Small one-off scripts. Transforming a single file under a few hundred MB? A Python script with pandas or a BigQuery SQL query is faster to write and run.
  • Low-throughput event handling. Processing a small number of events per second? A Pub/Sub subscription with a Cloud Run service or Cloud Function is simpler than a full streaming Beam pipeline.
  • Standard patterns covered by Dataflow templates. Google provides pre-built Dataflow templates for common jobs like Pub/Sub-to-BigQuery and Cloud Storage-to-BigQuery. If a template fits, you skip writing Beam code entirely. See the Dataflow Overview for template options.
  • Existing Spark jobs. If your team already has working Spark pipelines, migrating them to Dataproc is often simpler than rewriting in Beam. See Running Spark in GCP.
  • SQL-only transformations. If everything you need is SQL on data already in BigQuery, scheduled queries or BigQuery scripting avoids the need for a pipeline framework entirely.

Apache Beam vs Dataflow vs Spark/Dataproc

These three names come up together constantly. Here is how they relate:

DimensionApache Beam + DataflowSpark + Dataproc
What you writeBeam SDK code (Python, Java, Go)PySpark / Spark Scala code
Where it runsDataflow (serverless, fully managed)Dataproc (managed Spark clusters)
Cluster managementNone. Dataflow provisions and scales workers automatically.You choose cluster size, machine types, and manage lifecycle.
StreamingBuilt-in unified batch/streaming model.Spark Structured Streaming, but less native than Beam.
PortabilityBeam code runs on Dataflow, Flink, or Spark runners.Spark code runs on Dataproc, EMR, or local Spark.
Best forNew GCP pipelines, streaming, serverless ETL.Existing Spark jobs, ML workloads with Spark MLlib, interactive PySpark notebooks.
Quick decision guide

Choose Beam + Dataflow when you want zero cluster management, need both batch and streaming in one framework, or are building new pipelines on GCP from scratch. See Designing Data Pipelines for how Beam fits into a larger pipeline architecture.

Choose Spark + Dataproc when you have existing Spark jobs, need Spark-specific libraries (MLlib, GraphX), or want interactive PySpark notebook exploration. See the Dataproc Overview and Running Spark in GCP for details.

Common beginner mistakes

  1. Treating a PCollection like a Python list. A PCollection is distributed across workers. You cannot index into it, print it, or loop over it in your driver code. To inspect data during development, add a beam.Map(print) step inside the pipeline, or write to a file and check the output.
  2. Confusing Beam with Dataflow. Beam is the SDK. Dataflow is the runner. Saying “I wrote a Dataflow pipeline” is technically imprecise: you wrote a Beam pipeline that runs on Dataflow. This distinction matters when debugging, reading documentation, and choosing runners.
  3. Skipping local testing. Submitting an untested pipeline to Dataflow means waiting for workers to provision before you see the first error. Always run with the DirectRunner and a small data sample first. Most logic errors surface in seconds locally.
  4. Using duplicate step names. Every step in a Beam pipeline must have a unique name (the string before >>). Duplicate names cause a pipeline construction error. Use descriptive names like ‘Parse order CSV’ rather than generic ‘Parse’.
  5. Using global mutable state inside a DoFn. DoFn instances run on different workers. A module-level variable set on one worker is not visible to others. Pass shared configuration as constructor arguments to the DoFn class instead.
  6. Choosing Beam when a simpler tool works. Not every data task needs a distributed pipeline framework. A pandas script, a BigQuery scheduled query, or a Dataflow template might solve the problem with far less code.

Frequently asked questions

What is the difference between Apache Beam and Dataflow?

Apache Beam is the open-source SDK and programming model you use to write pipeline code in Python, Java, or Go. Dataflow is the managed GCP service that executes Beam pipelines at scale. You write Beam code; Dataflow runs it. Beam can also run on other runners like Apache Flink or Spark, but Dataflow is the standard runner on GCP.

Is Apache Beam for batch or streaming?

Both. Apache Beam uses a unified programming model for batch and streaming. Bounded PCollections (files, database exports) run as batch jobs. Unbounded PCollections (Pub/Sub messages, Kafka streams) run as streaming jobs. The same SDK, transforms, and code structure work in both modes.

What is a PCollection in simple terms?

A PCollection is a distributed dataset inside a Beam pipeline. Think of it as the data flowing between steps. Unlike a Python list, a PCollection is spread across multiple workers and you cannot index into it or loop over it directly. You transform it by applying PTransforms like Map, Filter, or GroupByKey.

Can I run Apache Beam locally before using Dataflow?

Yes. The DirectRunner runs your pipeline on your local machine using threads instead of distributed workers. Use it for development and testing with small data samples. When you are ready for production, switch to DataflowRunner by changing a single option. Your pipeline code stays the same.

When should I use Spark and Dataproc instead of Beam and Dataflow?

Use Dataproc with Spark when you already have Spark jobs you want to run on GCP, need Spark-specific libraries, or want interactive notebook exploration with PySpark. Use Beam with Dataflow when you want fully serverless execution with no cluster management, need unified batch and streaming support, or want pipeline portability across runners.

Last verified: 26 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.