ETL vs ELT in GCP: BigQuery, Dataflow, and dbt

ETL transforms data before loading it into the warehouse. ELT loads raw data first and transforms it inside the warehouse using SQL. In GCP, ELT is usually the right default: land raw data in BigQuery, then transform it with Dataform or dbt. ETL with Dataflow is still the right choice when transformation must happen before storage, such as streaming ingestion, PII masking, or formats BigQuery cannot read. This page helps you decide which pattern fits your pipeline, which GCP tools to use, and when to combine both.

Simple explanation

ETL and ELT describe the order of three steps: extracting data from a source, transforming it into a useful shape, and loading it into a destination. In ETL, you clean and reshape the data in a processing engine (like Dataflow) before it reaches the warehouse. In ELT, you load the raw data straight into the warehouse (BigQuery) and clean it there using SQL.

Analogy

ETL is a restaurant kitchen that preps and plates every dish before it leaves the kitchen. The dining room only sees finished meals. ELT is a hot-pot restaurant where raw ingredients are delivered to the table and diners cook them however they want. ELT gives you more flexibility because the raw ingredients are always available to reuse, but it requires a plan for how things will be cooked at the table.

ETL vs ELT: quick comparison

ETLELT
Order of stepsExtract → Transform → LoadExtract → Load → Transform
Where transformation happensProcessing engine (Dataflow / Apache Beam)Inside the warehouse (BigQuery SQL)
Raw data lands in BigQuery?No, only transformed data arrivesYes, raw data is loaded first
Primary GCP toolsDataflow, Apache Beam, Cloud ComposerBigQuery, Dataform, dbt, scheduled queries
Who usually owns the logicData engineers writing Python/Java Beam pipelinesData engineers or analysts writing SQL
Iteration speedSlower: pipeline code must be rebuilt and redeployedFaster: edit a SQL file, run the model
Cost shapeCompute cost for Dataflow workers + BigQuery storageBigQuery query cost (per TB scanned or slots) + storage
Best forStreaming, PII masking, binary parsing, high-throughput joinsBatch analytics, dimensional modeling, ad-hoc exploration
Poor fitBatch SQL transformations where BigQuery alone would sufficeStreaming data, pre-storage compliance requirements

How ETL works in GCP

In an ETL pipeline, data is extracted from a source system, transformed in a processing engine, and loaded into the destination in its final shape. The destination only ever receives clean, modelled data.

The primary GCP tool for ETL is Dataflow, which runs Apache Beam pipelines. You write pipeline code in Python or Java that reads from a source, applies transformations (parsing, filtering, joining, aggregating), and writes to a sink.

Batch ETL example

A nightly job reads CSV exports from Cloud Storage, validates and normalises the records, and writes clean rows to BigQuery:

Cloud Storage (CSV files) → Dataflow batch job → BigQuery (clean table)

Streaming ETL example

An application publishes events to Pub/Sub. A streaming Dataflow job consumes those events, parses the payload, masks PII fields, applies windowed aggregations, and writes results to BigQuery:

Pub/Sub → Dataflow (parse, mask PII, aggregate) → BigQuery

       Dead-letter topic

Why ETL still matters

ETL adds operational complexity. You maintain Beam pipeline code, manage Dataflow worker scaling, and redeploy when logic changes. That overhead is justified when:

  • Data arrives as a continuous stream and needs real-time windowing or aggregation before storage
  • Sensitive fields must be masked or redacted before any persistence (compliance requirement)
  • The raw format (binary, encrypted, proprietary protocol) cannot be loaded directly into BigQuery
  • You need to join streaming events with reference data at high throughput before writing
Note

Dataflow is not the only GCP option for ETL. Dataproc can run Spark ETL jobs, and Cloud Run can handle low-throughput event processing. But for most new ETL pipelines on GCP, Dataflow with Apache Beam is the default choice.

How ELT works in GCP

In ELT, raw data is extracted and landed in BigQuery (or Cloud Storage as an intermediate layer) without significant transformation. Transformation happens afterward inside BigQuery using SQL.

Common loading paths

  • Batch loads from Cloud Storage: upload CSV, JSON, Avro, or Parquet files to a Cloud Storage bucket, then load them into BigQuery with bq load or the Storage Write API
  • BigQuery Data Transfer Service: scheduled imports from SaaS sources (Google Ads, YouTube, S3) directly into BigQuery tables
  • Direct streaming inserts: applications write rows directly to BigQuery using the Storage Write API
  • Federated queries: query data in Cloud Storage directly from BigQuery without loading it

Transformation tools

Dataform is Google Cloud’s native SQL transformation tool. It is integrated into the BigQuery console, requires no separate infrastructure, and manages dependencies between SQL models. Dataform is a strong default for teams fully committed to GCP.

dbt (data build tool) is the cross-platform standard. It compiles SQL models, runs them in BigQuery, materialises results as tables or views, handles dependency ordering, runs data quality tests, and generates documentation. dbt is the better choice when you need portability across cloud providers or want access to the larger package ecosystem.

Scheduled queries in BigQuery can handle simple, standalone transformations without any external tool.

Analogy

Dataform and dbt are like recipe books for your warehouse kitchen. They tell BigQuery which ingredients (raw tables) to grab, what order to cook things in, and how to plate the final dish (mart table). Without a recipe book, you are running ad-hoc SQL by hand and hoping you remember the steps tomorrow.

ELT example flow

SaaS export / API → Cloud Storage (raw JSON) → BigQuery raw table

                                              Dataform or dbt

                                          BigQuery marts (clean tables)
-- Example Dataform model: models/marts/daily_revenue.sqlx
config {
  type: "table",
  schema: "marts"
}

SELECT
  DATE(order_timestamp) AS order_date,
  SUM(order_total)      AS daily_revenue,
  COUNT(DISTINCT order_id) AS order_count
FROM ${ref("raw_orders")}
WHERE status = 'completed'
GROUP BY 1

Changing the transformation means editing the SQL file and running the model. No Dataflow pipeline to rebuild and redeploy.

Tip

Start with ELT for every new BigQuery analytics pipeline. Load raw data first and iterate on SQL transformations. You can always add a Dataflow ETL step later for specific cases that require streaming or pre-load processing. Going the other direction (rebuilding ETL pipelines as ELT) is harder because you may not have preserved the raw data.

When to use ETL

  • Streaming ingestion with real-time requirements. An IoT platform publishes sensor readings to Pub/Sub every second. Dataflow consumes the stream, applies windowed aggregations (5-minute averages), and writes results to BigQuery. The transformation must happen at ingestion time because downstream consumers need near-real-time aggregated data, not raw events.
  • Pre-storage PII masking. A healthcare application processes patient records. Regulatory requirements prohibit storing unmasked patient identifiers in the data warehouse. Dataflow hashes or redacts sensitive fields before writing to BigQuery, so raw PII never exists in warehouse storage.
  • Binary or proprietary format parsing. A legacy system exports data in a custom binary protocol. BigQuery cannot parse it. A Dataflow pipeline reads the binary files, deserialises them into structured records, and loads the parsed output.
  • High-throughput stream enrichment. An ad-tech pipeline joins clickstream events arriving at 100K events/second with a reference dataset (campaign metadata). This join needs to happen at ingestion speed. Dataflow side inputs handle it efficiently, while a post-load BigQuery join would introduce latency and repeated full-table scans.
Warning

Do not default to Dataflow just because a pipeline “feels complex.” If the data is already in BigQuery and the transformation is expressible in SQL, a Dataflow pipeline adds operational overhead with no benefit. Complexity in business logic does not automatically mean you need a processing engine.

When to use ELT

  • Batch analytics on structured or semi-structured data. A retail company loads daily sales CSV exports into BigQuery and builds dimensional models (fact tables, dimension tables) with Dataform. Analysts iterate on business logic in SQL without waiting for engineering pipeline changes.
  • SaaS data consolidation. Marketing data from Google Ads, Facebook Ads, and Salesforce lands in BigQuery via the Data Transfer Service and third-party connectors. dbt models normalise the different source schemas into a unified reporting layer.
  • Ad-hoc exploration and prototyping. A data scientist loads a new dataset into BigQuery to evaluate whether it is useful. Raw data goes in first; transformation happens if and when the data proves valuable. No upfront pipeline investment.
  • Evolving requirements. The business changes how it defines “active user” quarterly. With ELT, the raw event data stays in BigQuery. The team updates the SQL model definition and reruns it. No re-extraction from the source system is needed.
Tip

ELT shines when requirements are unclear or changing. Because raw data is preserved in BigQuery, you can always add new transformation models later without going back to the source system. This makes ELT the safer default when you are still learning what questions the business will ask.

When to use both together

Many production pipelines use ETL and ELT in the same data flow. The boundary is the BigQuery raw landing zone: Dataflow handles everything upstream, and Dataform or dbt handles everything downstream.

Hybrid example: e-commerce event pipeline

Mobile app → Pub/Sub → Dataflow (parse JSON, mask email, validate schema)

                          BigQuery raw.events table

                     Dataform (sessionise, build fact tables, aggregate)

                          BigQuery marts.daily_metrics

The ETL layer (Dataflow) handles streaming ingestion, PII masking, and schema validation. These are things that must happen before data reaches storage.

The ELT layer (Dataform) handles sessionisation, dimensional modelling, and metric aggregation. These benefit from iterative SQL development and do not require real-time processing.

Analogy

The hybrid pattern works like a food supply chain. The ETL layer is the processing plant: it inspects, cleans, and packages raw produce before it reaches the warehouse shelves. The ELT layer is the store kitchen: it takes packaged ingredients off the shelf and turns them into meals on demand. Each layer does what it is best at, and the warehouse shelf (BigQuery raw table) is the clear handoff point.

The key decision: if a transformation must happen before storage (compliance, format, latency), it belongs in the ETL layer. If it can happen after storage, it belongs in the ELT layer where iteration is faster and cheaper.

Dataflow vs Dataform/dbt: which tool for which job

The real decision most teams face is not “ETL vs ELT” in the abstract. It is “should I write this transformation in a Dataflow pipeline or in a SQL model?” Here is how to decide:

Dataflow (Apache Beam)Dataform / dbt (BigQuery SQL)
Processing modelStreaming and batchBatch only
Transformation timingBefore data reaches BigQueryAfter data is in BigQuery
LanguagePython or Java (Beam SDK)SQL
Who can maintain itData engineersData engineers, analytics engineers, analysts
Iteration speedSlow: code change → build → deploy → validateFast: edit SQL → run model → check results
Operational overheadManage Dataflow jobs, worker scaling, pipeline monitoringMinimal: runs as BigQuery queries
Dependency managementCustom orchestration or Cloud Composer DAGsBuilt-in (Dataform refs, dbt refs)
TestingUnit tests in Beam, integration tests against test datasetsSQL assertions, schema tests, row-count checks
Best forStreaming, pre-load processing, complex event-driven logicBatch modelling, dimensional design, metric definitions

Rule of thumb: if the data is already in BigQuery and the logic can be expressed in SQL, use Dataform or dbt. Reach for Dataflow only when you need streaming, pre-load processing, or logic that SQL cannot express (custom Python/Java functions on individual records at scale).

Note

Dataform and dbt are not competing with Dataflow. They solve different problems. Dataflow moves and pre-processes data. Dataform and dbt model data that has already landed. Comparing them directly is like comparing a delivery truck to a kitchen. Both are part of getting dinner on the table, but they do different jobs.

Common beginner mistakes

  1. Choosing Dataflow when SQL in BigQuery would be simpler. Dataflow is powerful but operationally heavy. For batch transformations on data already sitting in BigQuery, a SQL model in Dataform or dbt is simpler, faster to iterate on, and cheaper to run. Save Dataflow for cases that genuinely require streaming or pre-load processing.
  2. Loading raw data with no staging or modelling plan. ELT only works if the T actually happens. Landing raw data in BigQuery and never building transformation models leaves analysts with unusable tables and creates a data swamp. Define your staging and mart layers before the first load.
  3. Treating dbt as the only ELT option in GCP. Dataform is Google’s native SQL transformation tool, integrated directly into the BigQuery console. Scheduled queries handle simple standalone transformations. Evaluate all three options (dbt, Dataform, and scheduled queries) before committing to one.
  4. Storing sensitive raw data without governance thinking. ELT means raw data lands in BigQuery before transformation. If that raw data contains PII, payment details, or regulated fields, you need column-level access controls, data masking policies, or a Dataflow ETL step to redact sensitive fields before loading. “Load everything raw” is not a governance strategy.
  5. Mixing ETL and ELT responsibilities with no clear boundary. When a pipeline has both a Dataflow step and a dbt/Dataform step, confusion arises about which layer owns which transformation. Set a clear rule: Dataflow handles ingestion, format parsing, and compliance requirements. The SQL layer handles business logic, modelling, and aggregation. The BigQuery raw table is the handoff point.
  6. Transforming data before understanding the requirements. With ETL, once you transform and discard raw data, you cannot go back without re-extracting from the source. ELT avoids this by preserving raw data. If you are unsure about final requirements, default to ELT so you can iterate on transformation logic later.

Frequently asked questions

What is the difference between ETL and ELT in GCP?

ETL (Extract, Transform, Load) transforms data in a processing engine like Dataflow before loading it into the destination. ELT (Extract, Load, Transform) loads raw data into BigQuery first, then transforms it using SQL with tools like Dataform or dbt. ELT is the default for most GCP analytics work because BigQuery handles large SQL transformations efficiently and preserves raw data for reprocessing.

When should I use Dataflow instead of BigQuery SQL transformations?

Use Dataflow when you need real-time streaming from Pub/Sub with windowing and aggregation, when you must mask or redact sensitive fields before any storage, when the raw data format (binary, encrypted, proprietary) cannot be loaded into BigQuery directly, or when you need to join streaming events with reference data at high throughput. For batch transformations on data already in BigQuery, SQL transformations with Dataform or dbt are simpler and cheaper.

Can ETL and ELT be used together in one pipeline?

Yes. A common hybrid pattern uses Dataflow as the ETL layer to ingest streaming data from Pub/Sub, apply light transformations like parsing or PII masking, and land the results in BigQuery raw tables. Dataform or dbt then runs as the ELT layer, transforming those raw tables into analytical models. Dataflow owns everything upstream of BigQuery; the SQL transformation tool owns everything downstream.

Is dbt or Dataform better for ELT in GCP?

Dataform is the Google-native option, integrated directly into BigQuery console with no separate infrastructure to manage. dbt is the cross-platform standard with a larger ecosystem of packages, community support, and cloud-agnostic portability. Teams already invested in GCP with no multi-cloud requirement often start with Dataform. Teams that need portability or use dbt elsewhere tend to choose dbt. Both produce SQL transformations that run inside BigQuery.

When should raw data not be loaded into BigQuery?

Do not load raw data directly into BigQuery when it contains PII or sensitive fields that must be masked before any persistence, when the data format is binary or proprietary and cannot be parsed by BigQuery, when regulatory requirements prohibit storing certain data categories in a data warehouse, or when the raw volume is extremely large and only a filtered subset is analytically useful. In those cases, filter or transform with Dataflow first.

Last verified: 26 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.