GCP Streaming Pipelines | Dataflow, Pub/Sub, and BigQuery

A streaming pipeline processes data continuously as it arrives, rather than waiting for all data to accumulate. In GCP, the standard pattern routes events through Pub/Sub, processes them in Dataflow using Apache Beam, and writes results to BigQuery in near-real time. This page explains how each piece fits together and when streaming is the right choice.

What this means in simple terms

Most data processing waits. You collect a day’s worth of logs, run a job overnight, and check the results in the morning. That is batch processing, and it works well for many use cases.

Streaming is for when you cannot wait. Instead of collecting data and processing it later, a streaming pipeline processes each event within seconds of it happening. Think of a fraud detection system that needs to flag a suspicious transaction before it clears, or an IoT dashboard that shows live sensor readings.

The challenge is that streaming data never ends. You cannot aggregate “all the data” because there is always more coming. So streaming pipelines use windows to group events into time-based chunks, watermarks to estimate when a window’s data is complete, and triggers to decide when to emit results. These three concepts are what make streaming pipelines fundamentally different from batch.

Analogy

Batch processing is like weighing all the fish at the end of the day. Streaming is like weighing each fish as it comes off the hook. Both give you weights, but streaming lets you react immediately if something unusual shows up.

What a streaming pipeline in GCP actually is

A GCP streaming pipeline follows a consistent pattern: producers generate events, Pub/Sub buffers and distributes them, Dataflow processes them continuously using Apache Beam, and results land in BigQuery or another sink.

The data is unbounded, meaning there is no defined end to the stream. Unlike a batch pipeline that reads a file from start to finish, a streaming pipeline reads from a source that produces events indefinitely. A Pub/Sub subscription is unbounded. A CSV file in Cloud Storage is bounded.

Real-world examples
  • Fraud detection scores each transaction in real time before it settles
  • IoT monitoring aggregates sensor readings every few minutes to detect anomalies
  • Live dashboards update operational metrics as events happen
  • Clickstream analytics tracks user behaviour across a site in near-real time
  • Alerting triggers notifications when error rates spike above a threshold

How streaming pipelines work in GCP

Here is how data flows through a typical GCP streaming pipeline, step by step:

  1. Events are produced. Applications, devices, or services generate events such as a purchase, a sensor reading, or a page view. Each event carries a payload and an event timestamp.
  2. Pub/Sub buffers and distributes them. Producers publish to a Pub/Sub topic. Pub/Sub provides at-least-once delivery and decouples producers from consumers. If Dataflow falls behind, Pub/Sub holds messages until they can be processed.
  3. Dataflow processes events continuously. A Dataflow job runs an Apache Beam pipeline that reads from the Pub/Sub subscription, applies transforms, and never terminates.
  4. Event time is read from the message. Each event’s timestamp is extracted from the message payload or Pub/Sub publish time. This is the event time that windowing is based on.
  5. Windows group events into time-based chunks. Because the stream never ends, windows define finite boundaries for aggregation. A 5-minute fixed window groups all events with event times between 14:00 and 14:05.
  6. Watermarks estimate completeness. Dataflow tracks the watermark, its best estimate of how far through event time it has progressed. When the watermark passes the end of a window, Dataflow considers that window’s data mostly complete.
  7. Triggers decide when results are emitted. The default trigger fires once when the watermark passes the window boundary. Custom triggers can fire early (for speculative results), on every late element, or on a combination of conditions.
  8. Results are written to a sink. The processed output goes to BigQuery, Cloud Storage, another Pub/Sub topic, or any supported destination.
Producer → Pub/Sub topic → Dataflow (Beam) → BigQuery

                        Dead-letter topic
                        (inspect & replay)
Event time vs processing time

Event time is when the event occurred. Processing time is when the pipeline received it. Always use event time for windowing. If you use processing time, a backlog of delayed messages produces aggregations that reflect when data arrived, not when things actually happened.

When to use streaming pipelines

Streaming is the right choice when the business genuinely needs results in seconds or minutes:

  • Real-time fraud scoring to block a transaction before it clears, not in tomorrow’s report
  • Operational alerting to trigger an alert when error rates spike, not when a batch job runs hours later
  • Live dashboards to show current state, not yesterday’s state
  • User-facing personalisation to adjust recommendations based on what a user is doing right now
  • IoT anomaly detection to identify a failing sensor before it causes damage

The common thread is latency-driven decision making. If acting on stale data has a meaningful cost (financial, operational, or user experience), streaming is worth the added complexity.

When batch is a better choice

Batch pipelines are simpler to build, cheaper to run, and easier to debug. If your use case does not require sub-minute latency, batch is almost always the better starting point.

  • Daily reports: analysts check dashboards once a day. A nightly batch job is sufficient.
  • Historical analysis: backfilling months of data is a batch problem, not a streaming one.
  • Cost-sensitive workloads: batch loads into BigQuery from Cloud Storage are free. Streaming writes are not.
  • Simple transformations: if you are just moving data from A to B on a schedule, a batch pipeline avoids windowing and watermark complexity entirely.
Rule of thumb

Start with batch and move to streaming only when the latency requirement justifies it. Many teams over-invest in streaming pipelines for workloads where a scheduled batch job would have been cheaper and simpler. See Designing Data Pipelines for more on making this decision.

Streaming pipelines vs batch pipelines

StreamingBatch
DataUnbounded, continuousBounded, finite
LatencySeconds to minutesMinutes to hours
ProcessingContinuous, always-onScheduled, runs to completion
ComplexityHigher: requires windowing, watermarks, triggers, late data handlingLower: reads input, transforms, writes output
CostHigher: workers run continuously, streaming writes to BigQuery cost per byteLower: workers run only during the job, batch loads to BigQuery are free
CorrectnessMust handle late data, duplicates, and out-of-order eventsSimpler: all data is present before processing starts
DebuggingHarder: state is distributed and continuousEasier: rerun the job on the same input
Use casesFraud detection, live dashboards, alerts, IoTReports, backfills, data warehouse loads, analytics

For a broader comparison of pipeline design approaches, see ETL vs ELT.

Windowing explained

Windows divide an unbounded stream into finite chunks so you can aggregate data. Beam provides three main window types.

Fixed (tumbling) windows split the stream into non-overlapping intervals of equal size. All events with event times between 14:00 and 14:05 go in one window; 14:05 to 14:10 in the next. Use these for clean, non-overlapping time buckets like hourly counts or 5-minute averages.

Sliding windows overlap. A 1-hour window that slides every 10 minutes produces a new window every 10 minutes, each covering the most recent 60 minutes. Use these for rolling averages and moving metrics.

Session windows group events by gaps in activity. A window opens when an event arrives and closes after a configurable period of inactivity (the gap duration). Use these for user session analytics where bursts of activity matter more than fixed time boundaries.

Analogy

Fixed windows are like counting cars on a motorway every 15 minutes. Each interval is the same length, no overlaps. Sliding windows are like checking the last hour of traffic every 10 minutes: overlapping views of the same road. Session windows are like measuring how long each driver stays in a rest stop. Each “window” starts and ends based on individual behaviour.

Watermarks, triggers, and late data explained

These three concepts work together to answer one question: when should the pipeline emit results for a window?

Watermarks

The watermark is Dataflow’s estimate of the current event-time frontier. It tracks how far through event time the pipeline has progressed. When the watermark passes the end boundary of a window, Dataflow considers that window’s input mostly complete.

The watermark is not a guarantee. It is a heuristic. Data can still arrive after the watermark has passed, and that data is considered late.

Analogy

Think of the watermark like a “latest expected arrival” estimate at an airport. When the gate agent believes all passengers have boarded (the watermark passes), they close the door. If someone runs up late, they might get on if the door is still open (allowed lateness), or they miss the flight entirely (late data discarded).

Triggers

Triggers control when a window’s results are emitted. The default trigger fires once when the watermark passes the window boundary. But you can configure more sophisticated behaviour:

  • Early triggers emit speculative results before the watermark passes the window. Useful for dashboards that need fast initial results even if they are incomplete.
  • Late triggers emit updated results when late data arrives after the watermark has passed.
  • Composite triggers combine conditions. For example: emit early every 30 seconds, then emit again on each late element.

Allowed lateness

Allowed lateness defines how long after the watermark passes a window boundary Dataflow should continue accepting late data for that window. Without allowed lateness, late data is silently discarded.

For example, setting allowed lateness to 2 hours means that if a window closes at 14:05 and the watermark passes, Dataflow will still accept events for that window until 16:05. When a late event arrives within the allowed lateness period, the late trigger fires and an updated result is emitted.

Watch your memory costs

Allowed lateness has a direct cost. Dataflow must keep window state in memory for the duration of the allowed lateness period. Very large allowed lateness values increase memory usage and worker costs. Set it based on how late your data realistically arrives, not as an arbitrarily large safety net.

Example: Pub/Sub to Dataflow to BigQuery

This example reads IoT sensor events from Pub/Sub, assigns event-time timestamps, applies 5-minute fixed windows with 2 hours of allowed lateness, and writes average temperatures per sensor to BigQuery using the Storage Write API.

import json
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.window import FixedWindows
from apache_beam.transforms.trigger import AfterWatermark, AfterProcessingTime, AccumulationMode
import apache_beam.transforms.window as window

options = PipelineOptions([
    '--runner=DataflowRunner',
    '--project=my-app-prod',
    '--region=europe-west2',
    '--streaming',
    '--temp_location=gs://my-app-prod-data/temp',
])

def assign_event_timestamp(element):
    """Use the timestamp from the event payload as the element's event time."""
    import apache_beam as beam
    record = json.loads(element)
    event_time = record['event_timestamp']
    return beam.window.TimestampedValue(record, event_time)

with beam.Pipeline(options=options) as pipeline:
    (
        pipeline
        | 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(
            topic='projects/my-app-prod/topics/sensor-readings'
          )
        | 'Assign event time' >> beam.Map(assign_event_timestamp)
        | 'Window 5 min' >> beam.WindowInto(
            FixedWindows(5 * 60),
            trigger=AfterWatermark(
                early=AfterProcessingTime(60),
            ),
            accumulation_mode=AccumulationMode.ACCUMULATING,
            allowed_lateness=2 * 60 * 60
          )
        | 'Key by sensor' >> beam.Map(lambda r: (r['sensor_id'], r['temperature']))
        | 'Average per sensor' >> beam.CombinePerKey(beam.combiners.MeanCombineFn())
        | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
            'my-app-prod:analytics.sensor_averages',
            method=beam.io.WriteToBigQuery.Method.STORAGE_WRITE_API,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
          )
    )
What each stage does
  • ReadFromPubSub reads raw messages from the Pub/Sub topic as bytes.
  • Assign event time parses the JSON payload and uses the event_timestamp field as the element’s event time. Without explicit timestamp assignment, Beam defaults to Pub/Sub publish time.
  • WindowInto groups events into 5-minute fixed windows. The trigger emits early speculative results every 60 seconds (useful for dashboards), then fires again when the watermark passes. Allowed lateness of 2 hours means late-arriving events are still accepted and trigger updated results. ACCUMULATING mode means each output includes all data seen so far for the window, not just the new elements.
  • Key by sensor + CombinePerKey computes the mean temperature per sensor within each window.
  • WriteToBigQuery writes results using the Storage Write API, which is cheaper and higher throughput than the legacy streaming insert API.

Common beginner mistakes

  1. Using processing time instead of event time for windowing. If you window by when Dataflow receives messages rather than when events occurred, a backlog of delayed messages produces misleading aggregations. Always extract the event timestamp from the message payload and assign it explicitly.
  2. Not configuring allowed lateness or triggers. Without allowed lateness, Dataflow discards late-arriving data after the watermark passes the window boundary. Without late triggers, even if allowed lateness is set, the pipeline may not emit updated results when late data arrives. Configure both to match your data’s real-world lateness profile.
  3. Choosing streaming when batch is enough. Streaming adds cost, complexity, and operational overhead. If your stakeholders check reports daily, a nightly batch pipeline is simpler and cheaper. Do not use streaming just because you can.
  4. Weak event schema design. Events without a clear event timestamp, unique ID, or consistent schema create problems downstream. Missing timestamps force fallback to processing time. Missing unique IDs make deduplication impossible. Invest in a clean event schema before building the pipeline.
  5. Ignoring cost implications. Dataflow workers running continuously cost significantly more than batch jobs that run for a few minutes per day. Streaming writes to BigQuery are charged per byte; batch loads from Cloud Storage are free. Estimate costs before committing to a streaming architecture. See BigQuery Pricing Explained for write cost details.
  6. Forgetting the —streaming flag. Beam pipelines default to batch mode. If your source is an unbounded Pub/Sub topic, pass —streaming in PipelineOptions. Without it, the pipeline fails or behaves incorrectly.

Cost and operational considerations

Streaming pipelines are more expensive to run than batch equivalents. Understand the trade-offs before committing.

  • Dataflow compute: streaming jobs run 24/7. Workers are billed per second. A job with 4 workers running continuously costs far more than a batch job that runs for 10 minutes per hour. Use —max_num_workers to cap autoscaling costs.
  • BigQuery writes: the Storage Write API charges per byte written. Batch loads from Cloud Storage are free. For high-volume streams, write costs can be significant. See BigQuery Pricing Explained for current rates.
  • Pub/Sub: charged per message and per byte delivered. At high throughput, Pub/Sub costs are non-trivial. Batching messages on the producer side reduces per-message overhead.
  • State and allowed lateness: Dataflow keeps window state in memory for the duration of the allowed lateness period. Longer allowed lateness means more state, more memory, and higher worker costs.
  • Operational overhead: streaming jobs need monitoring, alerting, and on-call attention. A stuck pipeline means stale data. Plan for alerting on pipeline lag and throughput from day one.
Cost trap

A streaming pipeline that nobody actually needs in real time is one of the most common sources of wasted cloud spend. Before building, ask: “What decision changes if this data is 1 hour old instead of 10 seconds old?” If the answer is “nothing,” use batch.

For broader cost strategies, see Cost Optimisation Strategies.

Related topics and next steps

Now that you understand streaming pipelines, these pages go deeper into the individual components:

Frequently asked questions

What is the difference between event time and processing time?

Event time is the timestamp embedded in the event payload that records when the event actually happened. Processing time is the moment the pipeline receives and processes it. These two times differ because of network delays, retries, and buffering. Correct streaming pipelines window by event time so that results reflect when things happened, not when they arrived.

What is a watermark in Dataflow?

A watermark is Dataflow's running estimate of how far behind in event time the pipeline currently is. When the watermark passes a window boundary, Dataflow considers that window complete and emits its result. Any data that arrives after the watermark has passed the window is treated as late data.

How is a streaming pipeline different from a batch pipeline?

A batch pipeline processes a bounded dataset that has a defined start and end. A streaming pipeline processes unbounded data continuously. Streaming requires windowing to group events into finite chunks, watermarks to estimate completeness, and late data handling to deal with out-of-order arrivals. None of these concepts exist in batch.

Should I use streaming inserts or the Storage Write API for BigQuery?

The BigQuery Storage Write API is recommended for all new streaming pipelines. It provides higher throughput, lower cost per byte, and supports exactly-once semantics when configured correctly. The older streaming insert API is more expensive and has lower throughput limits.

How much does a streaming pipeline cost compared to batch?

Streaming pipelines are almost always more expensive. Dataflow workers run continuously rather than on a schedule. BigQuery streaming writes cost more than free batch loads from Cloud Storage. The trade-off is latency: if you need results in seconds, streaming justifies the cost. If hourly or daily results are acceptable, batch is significantly cheaper.

Last verified: 26 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.