How to Design Data Pipelines in GCP: Batch vs Streaming

Designing a data pipeline in GCP means choosing how data moves from where it is produced to where it is analysed, and making sure nothing breaks silently along the way. The decisions that matter most are batch vs streaming, where data lands, which service runs transformations, how failures are handled, and how you monitor the whole thing. This guide walks through each decision so you can pick the right architecture without overengineering it.

Designing data pipelines in simple terms

A data pipeline moves data from a source to a destination and transforms it along the way. Every pipeline, no matter how complex, follows the same five-part model:

  1. Producer: the system that generates data. An application writing user events, a database exporting change logs, or a file landing in Cloud Storage.
  2. Transport: the service that carries data between stages. Pub/Sub for streaming events, Cloud Storage for batch files.
  3. Transform: the step that cleans, reshapes, or enriches data. Dataflow, BigQuery SQL, or a Spark job on Dataproc.
  4. Destination: where transformed data lands for consumption. BigQuery for analytics, Cloud Storage for a data lake, or an operational database for serving.
  5. Monitoring: the observability layer that tells you when something breaks. Alerts on job lag, message backlog, and task failures.
Analogy

Think of a data pipeline like a factory assembly line. Raw materials (data) arrive at one end. Each station (service) does a specific job: cleaning, sorting, reshaping. At the other end, a finished product (analytics table) rolls off the line. If any station jams, a sensor (monitoring) catches it before defective products pile up.

A collection of scripts that moves data between services is not a pipeline. A pipeline handles failures explicitly, can be rerun safely, and is monitored independently of the applications that produce the data.

How a GCP data pipeline works

Data enters from one or more sources, passes through a transport layer, gets transformed by a processing service, and lands in a destination where analysts or applications consume it. Monitoring runs alongside every stage.

Source data (app events, files, databases)

Transport (Pub/Sub or Cloud Storage)

Transform (Dataflow, BigQuery SQL, or Dataproc)

Destination (BigQuery, Cloud Storage, or serving store)

Monitoring (Cloud Monitoring alerts, dashboards, logs)

The specific services you pick at each layer depend on latency requirements, data volume, team expertise, and budget. The sections below walk through each choice.

The first decisions to make

Batch vs streaming

Batch pipelines process a bounded set of data on a schedule. A nightly job reads yesterday’s transaction files from Cloud Storage, transforms them, and loads them into BigQuery. Latency is measured in hours. Complexity is low.

Streaming pipelines process data continuously as it arrives. Events flow from Pub/Sub into Dataflow, which aggregates and writes results to BigQuery in near-real time. Latency is measured in seconds. Complexity is higher because you need windowing, watermarks, and late data handling.

Streaming is not automatically better. It adds continuous compute cost, operational complexity, and harder debugging. If stakeholders check reports once a day, a nightly batch job is simpler, cheaper, and easier to reason about. Start with batch. Move to streaming when the latency requirement genuinely justifies the trade-offs.

Warning

Choosing streaming “just in case” is the single most common overengineering mistake in GCP data pipelines. Streaming jobs run 24/7 and accumulate cost even at idle throughput. If nobody is watching a real-time dashboard, you are paying for latency nobody uses.

Where data should land

BigQuery is the default destination for analytics workloads. It handles petabyte-scale SQL queries without infrastructure management. If your pipeline exists to power dashboards, reports, or ad-hoc analysis, data should land in BigQuery.

Cloud Storage is the right destination when you need a data lake: raw files in open formats (Parquet, Avro) that multiple tools can read. Cloud Storage is also the staging layer for batch loads into BigQuery.

An operational database or serving store (Firestore, Cloud SQL, Bigtable) is the destination when downstream applications need low-latency lookups rather than analytical queries. Most teams need both: an operational store for the application and BigQuery for analytics.

How transformations should run

Dataflow is the default for new pipelines that need streaming, complex joins, or high-throughput parallel processing. It is serverless: you submit Apache Beam code and Dataflow handles provisioning, scaling, and teardown.

BigQuery SQL and scheduled queries are enough for many batch workloads. Load raw data first, then transform it with SQL. This is the ELT approach, and it avoids a separate compute layer entirely. Analysts can iterate on transformation logic without engineering involvement.

Dataproc is the right choice when your team already has Spark jobs they want to run on GCP without rewriting them in Beam. It gives you managed Spark clusters with more control and more operational responsibility than Dataflow.

How orchestration should work

Cloud Composer (managed Apache Airflow) is for multi-step workflows with dependencies between steps. A Composer DAG can run a Dataflow job, wait for it to finish, load results into BigQuery, then trigger a dbt transformation, all with retry policies and scheduling built in.

Simpler options are often enough. A single Dataflow job triggered by Cloud Scheduler does not need Composer. A BigQuery scheduled query runs on its own schedule. An event-driven pipeline triggered by a file landing in Cloud Storage needs only an Eventarc trigger and a Cloud Run service.

Tip

Ask yourself: “Does this pipeline have more than two steps that depend on each other?” If not, Cloud Composer is probably more orchestration than you need. A Cloud Scheduler cron trigger or an Eventarc event is simpler, cheaper, and has nothing to manage.

Reference architectures

Simple batch pipeline

Files land in Cloud Storage on a schedule (exported from an application, dropped by a partner, or generated by a database export). A Dataflow batch job or BigQuery scheduled query transforms the data and loads it into BigQuery. Cloud Scheduler triggers the job daily.

Cloud Storage (raw files)

Dataflow batch job or BigQuery scheduled query

BigQuery (analytics tables)

This is the simplest production-ready architecture. It covers the majority of reporting and analytics use cases. For help choosing between Dataflow and BigQuery SQL for the transform step, see the ETL vs ELT comparison.

Streaming analytics pipeline

Applications publish events to a Pub/Sub topic. A Dataflow streaming job consumes the subscription, applies transformations and windowing, and writes to BigQuery using the Storage Write API. A dead-letter topic captures messages that fail processing repeatedly.

Application → Pub/Sub topic → Dataflow streaming job → BigQuery

                              Dead-letter topic
                              (inspect and replay)

Dataflow handles backpressure automatically. If events arrive faster than the job processes them, Pub/Sub buffers them. The job autoscales workers to match throughput. Set —max_num_workers to cap cost.

Tip

Always add a dead-letter Pub/Sub topic to streaming pipelines. Without one, a consistently failing message (malformed JSON, a schema mismatch) causes the pipeline to retry indefinitely or silently drop data. Route failures to a dead-letter topic where they can be inspected and replayed once the root cause is fixed.

Hybrid batch and streaming

Some workloads need both. A streaming pipeline writes events to BigQuery in near-real time for operational dashboards. A separate nightly batch job runs heavier transformations on the same underlying data: joining with reference data, computing daily aggregates, rebuilding materialised views. The streaming path gives low latency; the batch path gives completeness and lower cost for heavy computation.

                   ┌─→ Pub/Sub → Dataflow → BigQuery (real-time table)
Source events ─────┤
                   └─→ Cloud Storage → nightly batch job → BigQuery (daily aggregates)
Note

The hybrid pattern avoids forcing all processing into streaming. Keep streaming for what genuinely needs low latency, and let batch handle everything else. This is how most mature data teams end up structuring their pipelines.

When to use this approach

  • Daily reporting and dashboards. A batch pipeline loads yesterday’s data into BigQuery overnight. Analysts query fresh tables each morning.
  • Real-time fraud detection. A streaming pipeline processes payment events within seconds, flags anomalies, and writes results to an operational store and BigQuery.
  • IoT sensor aggregation. A streaming pipeline windows sensor readings into 5-minute averages and writes to BigQuery for monitoring dashboards.
  • Log analytics. A batch or micro-batch pipeline moves application logs from Cloud Storage into BigQuery for troubleshooting and trend analysis.
  • Data warehouse refresh. A nightly ELT pipeline loads raw data into BigQuery, then dbt or scheduled queries build curated tables for the business.
  • Event-driven analytics. An event-driven architecture fans out events to multiple consumers, with one subscriber writing to BigQuery for analytics.

When not to use Dataflow

Dataflow is the right tool for high-throughput streaming, complex windowed aggregations, and parallel batch transformations. But it is not always necessary.

  • Simple batch SQL transformations. If your pipeline is “load files into BigQuery, then run SQL to clean and reshape,” BigQuery scheduled queries or dbt handle the transform step without a separate compute service.
  • Low-throughput event processing. For a few hundred events per minute, a Pub/Sub push subscription to Cloud Run is simpler and cheaper. Dataflow is worth its overhead only for high-throughput workloads with complex processing.
  • Existing Spark jobs. If your team already has PySpark code, Dataproc runs it on GCP without a rewrite. Migrating to Beam has a learning curve that may not be justified.
  • Single-step file loads. Loading CSV or Parquet files from Cloud Storage into BigQuery does not need Dataflow. Use bq load or the BigQuery Data Transfer Service directly.
Note

Not using Dataflow is not a compromise. Plenty of production pipelines run entirely on BigQuery scheduled queries and bq load. The best pipeline is the simplest one that meets your latency and correctness requirements.

Batch vs streaming: quick comparison

BatchStreaming
LatencyHours (scheduled runs)Seconds to minutes
ComplexityLow: bounded data, clear start and endHigher: windowing, watermarks, late data
Compute costPay only when the job runsContinuous cost while the job is running
DebuggingEasier: rerun with the same inputHarder: state, ordering, and timing matter
Common use casesDaily reports, data warehouse loads, backfillsFraud detection, live dashboards, IoT alerts
Operational burdenLower: jobs finish and stopHigher: jobs run 24/7, need monitoring

Dataflow vs Dataproc vs BigQuery scheduled queries

DataflowDataprocBigQuery scheduled queries
ModelServerless (Apache Beam)Managed clusters (Spark/Hadoop)Serverless SQL
Best forNew streaming and batch pipelinesExisting Spark/Hadoop workloadsSQL-only batch transforms
StreamingYes, nativeYes, via Spark Structured StreamingNo
Cluster managementNoneYou size and manage clustersNone
Learning curveBeam SDK (moderate)Spark (moderate to high)SQL (low)
When to avoidLow-throughput events, simple SQL transformsGreenfield pipelines with no Spark historyStreaming, complex joins across external sources

If you are starting from scratch with no existing Spark code, Dataflow is the default for streaming and BigQuery SQL is the default for batch transforms. See the Dataflow overview and Spark on GCP guide for deeper comparisons.

Design for reliability

Failure handling

Dead-letter handling. Every pipeline that consumes from Pub/Sub should have a dead-letter topic. Messages that fail processing after a configured number of retries are routed there automatically. Without one, a single malformed message causes indefinite retries or silent data loss.

Retries. Transient failures (network blips, temporary quota limits) are normal. Dataflow retries failed work items automatically. Cloud Composer retries failed tasks according to the DAG’s retry policy. Configure retry counts and backoff intervals explicitly rather than relying on defaults.

Idempotency. Pipelines fail and need to be rerun. If rerunning creates duplicate rows, you have a data quality problem. For batch loads, use WRITE_TRUNCATE to overwrite the target partition on each run. For streaming writes, use the BigQuery Storage Write API with exactly-once semantics, or write MERGE statements that upsert on a unique key.

Warning

Using WRITE_APPEND with no deduplication is the fastest way to corrupt a BigQuery table. After a pipeline failure and retry, every appended row appears twice. The duplicates may not surface until an analyst notices aggregations are wrong, days later. Default to WRITE_TRUNCATE for batch and MERGE or the Storage Write API for streaming.

Data quality

Schema validation and schema evolution. Pipelines break when upstream producers change event structure without notice. Validate incoming messages against a JSON Schema or Protocol Buffer definition at the ingestion step. Reject non-conforming records to a dead-letter topic. Plan for schema evolution: add new fields as nullable, version your schemas, and test schema changes in staging before production.

Replay and backfill. You will need to reprocess historical data at some point: after a bug fix, a schema change, or a new business requirement. Design for this from the start. Keep raw data in Cloud Storage so you can replay it through the pipeline. For streaming, Pub/Sub retains messages for up to 31 days, but for longer replay windows, archive raw events to Cloud Storage.

Tip

Store raw data before any transformation. If you only keep transformed data and a transformation bug ships, reprocessing means re-extracting from the source system, which may not be possible weeks later. A raw archive in Cloud Storage is cheap insurance.

Operations and cost

Observability. Monitor pipeline health separately from application health. Track Dataflow job lag, Pub/Sub subscription backlog (unacked message count), Cloud Composer task failure rates, and BigQuery slot utilisation. Set up alerts on these metrics so stale data is caught before an analyst notices a broken dashboard.

Regional placement. Run Dataflow workers in the same region as your Cloud Storage buckets and BigQuery datasets. Cross-region data transfer adds latency and incurs egress charges. Always specify —region when submitting a Dataflow job.

IAM and service accounts. Each pipeline component should run under a dedicated service account with the minimum permissions it needs. The Dataflow worker service account needs read access to the source and write access to the destination, nothing more. Avoid using the default compute service account for production pipelines.

Cost ceilings and guardrails. Set —max_num_workers on Dataflow streaming jobs to cap autoscaling cost. Use BigQuery cost controls (custom quotas, maximum bytes billed) to prevent runaway queries. Set budget alerts in billing so unexpected spikes are caught early.

Common mistakes

  1. Building a streaming pipeline when batch is sufficient. Streaming adds ongoing compute cost and operational complexity. If stakeholders check reports once a day, a nightly batch job is simpler, cheaper, and easier to debug. Start with batch; move to streaming only when the latency requirement is clear and measured.
  2. No dead-letter handling. Without a dead-letter topic, a single malformed message causes indefinite retries or silent data loss. Always route failed messages to a dead-letter topic where they can be inspected and replayed.
  3. Not designing writes to be idempotent. Pipelines fail and are rerun. WRITE_APPEND on failure creates duplicate rows that may not surface until an analyst notices incorrect aggregations. Use WRITE_TRUNCATE for batch partition loads and the Storage Write API or MERGE for streaming.
  4. Skipping schema contracts. When upstream producers change event structure without notice, pipelines break in ways that are hard to diagnose. Validate schemas at ingestion. Reject or quarantine non-conforming messages.
  5. Not planning for reprocessing. Every pipeline will need to reprocess historical data eventually. Keep raw data in Cloud Storage so you can replay it. If you only keep transformed data, reprocessing means re-extracting from the source.
  6. Not isolating staging from production. Testing pipeline changes directly against production data risks corrupting production tables. Maintain a separate staging project or dataset with its own Pub/Sub topics, Dataflow jobs, and BigQuery tables.
  7. Skipping pipeline monitoring. Application monitoring is not enough. Monitor Dataflow job lag, Pub/Sub subscription backlog, and Composer task failure rates separately. A pipeline that stops processing silently is worse than one that fails loudly.

Frequently asked questions

Should I use batch or streaming for my GCP data pipeline?

Start with batch. If stakeholders check dashboards once a day, a nightly job that lands data in BigQuery is simpler, cheaper, and easier to debug. Move to streaming only when latency requirements are measured in seconds, such as fraud detection, live dashboards, or real-time personalisation. Streaming adds windowing, watermark management, and continuous compute cost.

What is the difference between Dataflow and Dataproc?

Dataflow is serverless: you submit Apache Beam code and it handles provisioning, scaling, and teardown. Dataproc gives you managed Spark or Hadoop clusters where you choose machine types and manage the lifecycle. Use Dataflow for new pipelines written from scratch. Use Dataproc when you have existing Spark or Hadoop jobs you want to run on GCP without rewriting them.

When is Cloud Composer worth the cost?

Cloud Composer (managed Apache Airflow) is worth it when your pipeline has multiple dependent steps that must run in a specific order with retry logic and scheduling. For example: run a Dataflow job, wait for it to finish, load results into BigQuery, then trigger a dbt run. For a single scheduled job, a Cloud Scheduler trigger is simpler and cheaper.

Can I use BigQuery alone without Dataflow?

Yes, for many batch workloads. Load raw data into BigQuery via batch loads from Cloud Storage, then transform it with scheduled SQL queries or dbt. This ELT approach avoids a separate compute layer entirely. Dataflow is needed when you require real-time streaming, complex event processing, or pre-load transformations like data masking.

How do I prevent duplicate rows when a pipeline fails and reruns?

Make all writes idempotent. For batch pipelines, use WRITE_TRUNCATE to overwrite the target partition on each run so a re-run produces the same final state. For streaming pipelines, use the BigQuery Storage Write API with exactly-once semantics, or write MERGE statements that upsert on a unique key instead of appending.

Last verified: 26 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.