Data Lake Architecture in GCP Explained: Zones, BigQuery, BigLake
A data lake architecture in GCP stores raw data in Cloud Storage and organises it into zones so teams can ingest first and model later. BigQuery, BigLake, and Dataplex add query power and governance on top. The result is a flexible, cost-effective foundation for analytics, machine learning, and data pipelines.
Simple explanation
Think of a data lake as a central place where you dump all your raw data, no matter the format: CSV exports, JSON logs, database backups, event streams, images. You do not need to clean it or define a schema before storing it. The data sits in Cloud Storage, which is cheap and scales without limits.
Once the data is there, you organise it into zones. Raw data stays untouched in one zone. Cleaned, validated data moves to a second zone. Analytics-ready data goes to a third. When you need answers, you query it with BigQuery or BigLake SQL, pulling from whichever zone has the data you need.
Store everything, transform what you need, query when ready. The schema is applied at read time, not at write time. This is the opposite of a traditional data warehouse, where you must define the schema before loading anything.
Why data lake architecture matters
Traditional data warehouses require you to define the schema before loading data. That works well for structured, well-understood datasets. But when data arrives in many formats, from many sources, at unpredictable volumes, forcing it into a schema upfront slows teams down and discards information.
A data lake solves this by separating storage from processing:
- Flexible ingestion. Accept structured tables, semi-structured JSON, unstructured logs, and binary files without redesigning your schema.
- Low storage cost. Cloud Storage is significantly cheaper than warehouse storage for large volumes of raw data, especially with Nearline or Coldline classes for older data.
- Analytics when ready. Apply different schemas to the same raw data for different use cases: BI dashboards, data science notebooks, ML training pipelines.
- Reprocessing safety. Keeping raw data immutable means you can always reprocess from source if downstream transformations produce incorrect results.
- Support for multiple consumers. Data engineers, analysts, and data scientists all access the same underlying data through different tools and patterns.
For most GCP teams, a data lake works alongside BigQuery rather than replacing it. Raw data stays in Cloud Storage; clean, aggregated data serves analysts from BigQuery. The two patterns are complementary.
How a data lake architecture works in GCP
A typical GCP data lake follows this flow:
- Data sources. Application databases, SaaS APIs, IoT devices, clickstream events, log files, third-party exports. Data arrives in whatever format the source produces.
- Ingestion. Pub/Sub handles real-time event streams. Dataflow runs batch and streaming ingestion pipelines. Cloud Storage Transfer Service pulls from external sources.
gsutilorgcloud storage cphandles ad-hoc uploads. - Raw zone (Cloud Storage). Data lands exactly as received. No transformations. Append-only. This is your safety net for reprocessing.
- Curated zone (Cloud Storage). Dataflow or Dataproc jobs clean, validate, deduplicate, and convert data to columnar formats like Parquet. Data engineers own this zone.
- Consumption zone (Cloud Storage or BigQuery). Aggregated, business-ready data optimised for read performance. Often loaded into BigQuery managed tables, or kept as Parquet files queried via BigLake.
- Query and governance layer. BigQuery and BigLake provide SQL access across all zones. Dataplex catalogues assets, tracks lineage, and enforces data quality rules.
A data lake is like a water treatment system. The river (raw zone) carries everything: clean water, debris, sediment. Filtration (curated zone) removes impurities and standardises the water. The reservoir (consumption zone) holds clean water ready for users. You always keep the river water so you can re-treat it if the filtration process changes.
Core zones in a GCP data lake
Well-governed data lakes organise data into zones based on how processed and trusted the data is. Each zone maps to a separate Cloud Storage bucket (or prefix) with its own IAM policies.
Raw zone
Contains data exactly as it arrived from source systems. No transformations. This is your reprocessing safety net. Raw data must be immutable once landed. Never overwrite or delete raw data. Use Cloud Storage object versioning or retention policies to enforce this.
Treat raw zone data like a backup tape. Once written, it should never be modified or deleted. If a pipeline produces bad results six months from now, this untouched copy is the only thing that lets you reprocess from scratch.
Curated zone
Contains data that has been cleaned, validated, deduplicated, and standardised. Sensitive fields may be masked. Data is stored in columnar formats (Parquet or ORC) so BigQuery can query it efficiently via external tables. Data engineers own this zone.
Consumption zone
Holds aggregated, business-ready data that analysts and data scientists query directly. Data here often lives in BigQuery managed tables or as Parquet files queried via BigLake. Optimised for read performance, not ingestion flexibility.
| Zone | What it stores | Typical format | Who uses it | Main rule |
|---|---|---|---|---|
| Raw | Untouched source data | CSV, JSON, Avro, any | Data engineers | Append-only, never modify |
| Curated | Cleaned, validated data | Parquet, ORC | Data engineers, analysts | Schema-enforced, deduplicated |
| Consumption | Aggregated, business-ready data | Parquet, BigQuery tables | Analysts, data scientists | Optimised for fast queries |
Example architecture
Here is a concrete example for an ecommerce platform:
- Sources: Clickstream events from the website, order exports from the database, application logs from Cloud Logging, marketing campaign CSVs from a third-party vendor.
- Ingestion: Clickstream events flow through Pub/Sub into the raw zone. Order exports arrive via a nightly Dataflow batch job. Logs are exported from Cloud Logging. Campaign CSVs are uploaded manually.
- Raw zone: All data lands in
gs://ecom-lake-raw/in its original format, partitioned by date. - Curated zone: A Dataflow pipeline reads raw clickstream JSON and order CSVs, deduplicates records, enforces data types, masks customer PII, and writes Parquet to
gs://ecom-lake-curated/. - Consumption zone: Aggregated daily revenue, funnel conversion rates, and customer segment tables load into BigQuery. Infrequently queried historical data stays as Parquet in
gs://ecom-lake-consumption/and is queried via BigLake.
# Create zone buckets with uniform bucket-level access
gcloud storage buckets create gs://ecom-lake-raw \
--location=europe-west2 --uniform-bucket-level-access
gcloud storage buckets create gs://ecom-lake-curated \
--location=europe-west2 --uniform-bucket-level-access
gcloud storage buckets create gs://ecom-lake-consumption \
--location=europe-west2 --uniform-bucket-level-accessA data lake without governance becomes a data swamp: unlabelled, undocumented, untrusted data that no one can use confidently. Set up Dataplex to catalogue your data assets and enforce quality rules from day one, not after the lake has grown uncontrolled.
File formats and partitioning
The format you choose directly affects query speed, storage cost, and tooling compatibility.
CSV
CSV is acceptable in the raw zone for preserving original source data. It is human-readable and universally supported. However, CSV is row-based with no type information, no compression by default, and no column pruning. Do not use CSV in the curated or consumption zones.
Parquet
Parquet is the standard format for analytical data in GCP data lakes. It is columnar: queries reading only a subset of columns skip unneeded data entirely. Parquet compresses well and includes type metadata. BigQuery queries Parquet via external tables or BigLake efficiently because it pushes column pruning and filter predicates down to the file scan. Use Parquet in the curated and consumption zones.
CSV is like a printed spreadsheet where you must read every row to find what you want. Parquet is like a filing cabinet organised by column: if you only need the “revenue” column, you pull that drawer and ignore everything else. That is why Parquet queries are faster and cheaper.
Avro
Avro is row-based with strong schema evolution support. It is well suited for event data and Pub/Sub message schemas where the schema changes over time. Use Avro for streaming and event data in the raw zone when schema compatibility matters.
Hive-style partitioning
Organise files using Hive-style directory structure that mirrors your query patterns. For time-series data:
gs://ecom-lake-curated/
orders/
year=2026/month=03/day=10/
orders-001.parquet
year=2026/month=03/day=11/
orders-001.parquetWhen you create a BigQuery external table over this path with partition detection enabled, BigQuery prunes partitions during queries. A filter on year = 2026 AND month = 3 AND day = 10 scans only that day’s directory, not the entire bucket. Without partitioning, BigQuery scans all files, which increases both cost and query time.
Use Hive-style partition keys (year=2026/month=03/day=10/) rather than plain date paths (2026/03/10/). Hive-style keys are automatically detected by BigQuery and Dataflow without extra configuration.
When to use a data lake architecture
A data lake pattern is a good fit when:
- You have large volumes of raw data in multiple formats (JSON, CSV, Avro, logs, images) that you want to store before deciding how to use it.
- You need cheap storage first and modelling later. Cloud Storage costs a fraction of warehouse storage for rarely queried data.
- Multiple teams need different views of the same raw data: BI dashboards, data science exploration, ML training datasets.
- You want to keep a historical, immutable copy of all raw data for reprocessing if business logic changes.
- Your data pipelines are ELT-style: load first, transform later.
You do not need every GCP service on day one. Start with Cloud Storage buckets for your zones, a Dataflow job to move data between them, and BigQuery external tables to query the results. Add Dataplex and BigLake as your lake grows.
When not to use a data lake architecture
A data lake adds complexity that is not always justified:
- Small structured datasets. If your data fits cleanly into BigQuery tables and is always structured, a lake layer adds overhead with no real benefit. Load directly into BigQuery.
- Fully modelled data requirements. If every consumer needs the same clean schema immediately, a warehouse-first approach is simpler. Skip the raw zone.
- No multi-format retention needs. If you only work with structured tables and never need to reprocess from raw, Cloud Storage adds a layer you do not use.
- Small team, limited data engineering capacity. A data lake needs ongoing governance, pipeline maintenance, and format management. If you do not have the team to support it, start with BigQuery and add a lake layer when the need arises.
Data lake vs data warehouse vs lakehouse
These three patterns serve different needs. In GCP, you often combine them. For a deeper comparison, see Data Warehouses vs Data Lakes.
| Pattern | Storage | Schema | Best for | GCP services |
|---|---|---|---|---|
| Data lake | Cloud Storage | Schema-on-read | Raw multi-format data, cheap retention, ML exploration | Cloud Storage, Dataflow, Dataplex |
| Data warehouse | BigQuery managed storage | Schema-on-write | Structured analytics, BI dashboards, frequent SQL queries | BigQuery |
| Lakehouse | Cloud Storage + BigQuery query layer | Schema-on-read with governance | Combining lake flexibility with warehouse-style queries | BigLake, BigQuery external tables, Dataplex |
Most production GCP platforms use a hybrid. Raw data lands in Cloud Storage (lake), frequently queried clean data lives in BigQuery (warehouse), and BigLake bridges the two by letting you query lake data with warehouse-level access control. You do not have to pick one pattern.
Common beginner mistakes
- Mixing raw and transformed data in the same bucket. Keep zones strictly separated with distinct buckets and IAM policies. If raw and curated data intermix, you cannot tell which data is trusted. Separate buckets enforce the boundary.
- Overwriting or deleting raw data. Raw zone data must be append-only. If you overwrite raw data, you lose the ability to reprocess from source. Use Cloud Storage retention policies or object versioning to protect raw zone data.
- Skipping governance. A data lake without a catalogue (Dataplex) becomes a swamp. No one knows what data exists, who owns it, when it was last updated, or whether it can be trusted. Catalogue assets from the start.
- Storing analytics data as CSV. CSV works in the raw zone for preserving original source data. But if your curated or consumption zone data is also CSV, BigQuery queries will be slower and more expensive than Parquet. Convert to columnar format in the curated layer.
- Poor naming and partition strategy. Inconsistent bucket paths, missing partition keys, and vague folder names make data impossible to discover and expensive to query. Establish naming conventions and Hive-style partitioning before the lake grows.
- Weak IAM boundaries between zones. If the same service account can write to raw, curated, and consumption zones, a bad pipeline run can corrupt trusted data. Use separate IAM roles per zone so raw-zone writers cannot modify curated data.
Key takeaways
- A GCP data lake stores raw data in Cloud Storage with schema applied at query time, not at ingestion.
- Three zones keep data organised: raw (untouched, immutable), curated (cleaned, columnar), consumption (aggregated, analytics-ready). Use separate buckets per zone.
- BigQuery queries lake data through external tables or BigLake. BigLake adds row- and column-level access control to Cloud Storage data.
- Dataplex provides governance: cataloguing, lineage tracking, and data quality enforcement. Set it up from the start.
- Use Parquet in curated and consumption zones. Use Avro for streaming data. Keep raw zone files in their original format.
- Hive-style partition keys enable automatic partition pruning in BigQuery, reducing scan cost and query time.
- Most GCP teams combine a data lake (Cloud Storage) with a data warehouse (BigQuery). The two patterns are complementary, not competing.
Frequently asked questions
What is a data lake architecture in GCP?
A data lake architecture in GCP uses Cloud Storage as the primary storage layer for raw, semi-structured, and unstructured data. Data is organised into zones (raw, curated, consumption), governed with Dataplex, and queried through BigQuery or BigLake. The schema is applied at query time, not at ingestion, so teams can store data before deciding exactly how to use it.
When should I use a data lake instead of BigQuery tables only?
Use a data lake when you need to store large volumes of multi-format data cheaply before deciding how to model it. BigQuery managed storage is ideal for frequently queried, structured data. A data lake in Cloud Storage is better for raw logs, event streams, images, and data you query occasionally. Most teams use both together.
What is the difference between a data lake and a data warehouse?
A data lake stores raw data in any format without a predefined schema (schema-on-read). A data warehouse like BigQuery stores structured, cleaned data with a fixed schema (schema-on-write). A lakehouse pattern combines both: raw data in Cloud Storage queried through BigQuery via BigLake or external tables.
What does BigLake do in a data lake architecture?
BigLake lets you query data stored in Cloud Storage using BigQuery SQL, with fine-grained access control at the row and column level. It bridges the gap between a data lake and a warehouse by applying warehouse-style governance to lake storage. You get the flexibility of Cloud Storage with the query power of BigQuery.
Why is Dataplex useful in a GCP data lake?
Dataplex provides data governance for your lake: cataloguing assets, tracking data lineage, enforcing quality rules, and managing metadata across Cloud Storage and BigQuery. Without governance, a data lake becomes a data swamp of unlabelled, undocumented data that no one trusts or can use effectively.