How to Load Data into BigQuery: Methods and Best Practices
Loading data into BigQuery means getting your data into BigQuery’s managed columnar storage so you can query it with SQL. There are several ways to do this: batch loads from files, real-time streaming, scheduled transfers from SaaS platforms, continuous database replication, and even querying external data without loading it at all. Choosing the right method determines your cost, data freshness, and pipeline complexity.
How to load data into BigQuery
Most teams start with batch loads from Cloud Storage because they are free, simple, and handle the majority of analytics workloads. When seconds of freshness matter, the Storage Write API provides real-time streaming. The Data Transfer Service automates imports from Google Ads, GA4, and other platforms. Datastream handles CDC replication from operational databases. And for small reference datasets, external tables let you query data in place without loading it.
The cost difference between these methods is significant. A batch load from Cloud Storage costs nothing for the load job itself. Streaming the same data can cost 50 times more per GB. Understanding these trade-offs before you build a pipeline prevents expensive mistakes.
The simple explanation
There are three fundamentally different things people mean when they say “getting data into BigQuery”:
Loading data copies your files or records into BigQuery’s own storage. Once loaded, the data lives inside BigQuery, benefits from columnar storage and compression, and is fast to query. Batch loads and streaming both do this. They differ in speed and cost.
Transferring data is a scheduled, managed version of loading. Services like BigQuery Data Transfer Service and Datastream automate the pipeline so you do not write custom code to pull data from external sources.
Querying external data does not load anything. BigQuery reads from an external source (a Cloud Storage bucket, a Google Sheet) at query time. The data stays where it is. This is convenient for small reference tables but slower and more expensive to query than native BigQuery tables.
Loading data is like moving books onto your bookshelf. They are fast to find and well-organised. Querying external data is like reading a book in the shop without buying it: convenient for a quick look, but you would not do it every day.
How loading data into BigQuery works
BigQuery provides five core paths for getting data in. Each serves a different use case.
Batch load from local files or Cloud Storage. You upload CSV, JSON, Avro, Parquet, or ORC files to a Cloud Storage bucket (or load directly from your local machine) and submit a load job. BigQuery reads the files, converts the data to its internal Capacitor columnar format, and writes it to the destination table. The load job itself is free. You pay only for storage.
Streaming with the Storage Write API. Applications send individual rows or small batches to BigQuery via the Storage Write API. Data becomes queryable within seconds. This is the path for real-time dashboards, fraud detection, and any scenario where seconds of freshness genuinely matter.
Scheduled imports with BigQuery Data Transfer Service. Pre-built connectors pull data from Google Ads, GA4, YouTube Analytics, Campaign Manager, Amazon S3, and other sources on a schedule you configure. Data Transfer Service also supports scheduled queries and recurring loads from Cloud Storage.
CDC and near-real-time replication. Datastream captures row-level changes (inserts, updates, deletes) from operational databases like MySQL, PostgreSQL, AlloyDB, and Oracle, and replicates them into BigQuery continuously. This keeps your analytics tables in sync with your production database without building custom data pipelines.
External tables and Google Sheets. BigQuery defines a table that points to data in Cloud Storage, Google Sheets, or Bigtable. Queries read from the external source at query time. No data is copied into BigQuery managed storage, which means no columnar optimisation, no clustering benefit, and slower performance. This is not loading data. It is a query-without-loading option for small or infrequently accessed datasets.
All five paths above are valid ways to work with data in BigQuery, but only the first three actually move data into BigQuery’s managed storage. External tables and Google Sheets leave data in place and read it on the fly.
Choose the right method
Use this comparison to pick the right ingestion path for your workload.
| Method | Best for | Data freshness | Cost pattern | Complexity | Typical tools |
|---|---|---|---|---|---|
| Batch load (Cloud Storage) | Regular ETL, historical backfills | Minutes to hours | Free (load job) | Low | bq load, LOAD DATA SQL, Cloud Composer |
| Storage Write API | Real-time dashboards, fraud detection | Seconds | ~$0.025/GB | Medium to high | Dataflow, custom app with client library |
| Data Transfer Service | SaaS imports, scheduled Cloud Storage loads | Hours (scheduled) | Free (service); source costs vary | Low | Console, bq mk —transfer_config |
| Datastream (CDC) | Database replication to BigQuery | Seconds to minutes | Per-GB processed | Medium | Datastream console, Terraform |
| External tables / Sheets | Small reference data, ad-hoc exploration | Live (reads at query time) | Query cost only (no load) | Low | Console, CREATE EXTERNAL TABLE |
When in doubt, start with a batch load from Cloud Storage. It is free, simple, and covers the majority of analytics use cases. Move to streaming or CDC only when you have a confirmed business requirement for fresher data.
Batch loads from local files or Cloud Storage
Batch loading is the most common and cost-efficient way to get data into BigQuery. You upload data files to a Cloud Storage bucket, submit a load job, and BigQuery writes the data to a table. The load job carries no compute charge. You pay only for storage of the source files and the destination data.
Supported formats: CSV, newline-delimited JSON, Avro, Parquet, and ORC. Avro and Parquet are strongly preferred for large loads. They are binary formats with embedded schemas, which makes them faster to load, more reliable to parse, and better compressed than CSV. If you are choosing a format for a new pipeline, default to Parquet.
When to use batch loads
Use batch loads for regular ETL pipelines, historical backfills, nightly or hourly data refreshes, and any workload where data freshness of minutes to hours is acceptable. Most analytics teams run batch loads as their primary ingestion method.
When not to use batch loads
Do not use batch loads when you need data queryable within seconds. Batch load jobs take time to queue and execute, typically seconds to minutes for small files and longer for large ones. If your use case requires sub-minute freshness, use the Storage Write API instead.
Example: bq load from Cloud Storage
# Load a Parquet file from Cloud Storage into a partitioned table
bq load \
--source_format=PARQUET \
--time_partitioning_field=event_timestamp \
my_project:analytics.events \
gs://my-project-data/events/2026-03-25/*.parquetYou can also use the LOAD DATA SQL statement, which is useful when you want to define the load as part of a scheduled query or a SQL-based workflow.
-- Load CSV data from Cloud Storage using SQL
LOAD DATA OVERWRITE analytics.daily_sales
FROM FILES (
format = 'CSV',
uris = ['gs://my-project-data/sales/2026-03-25/*.csv']
);Use wildcards to load multiple files in a single job: gs://my-bucket/events/2026-03-*.parquet. BigQuery processes all matching files in parallel, making multi-file loads faster than loading sequentially.
Streaming data with the Storage Write API
The Storage Write API lets applications send rows to BigQuery and make them queryable within seconds. It supports exactly-once semantics, higher throughput, and lower cost than the older streaming insert API. All new streaming pipelines should use the Storage Write API.
Streaming costs roughly $0.025 per GB via the Storage Write API, and $0.05 per GB with the older insert API. A batch load from Cloud Storage costs nothing. Do not stream data when a batch load would meet your freshness requirements. The cost difference compounds quickly at scale.
When to use streaming
Use streaming when your business genuinely needs near-real-time data: a live dashboard showing the last five minutes of activity, fraud detection on individual transactions, real-time bidding, or operational alerting. The standard GCP pattern routes events through Pub/Sub, transforms them in Dataflow, and writes to BigQuery using the Storage Write API.
When not to use streaming
Do not use streaming for nightly or hourly batch jobs. If your stakeholders check dashboards once a day, batch loading is entirely sufficient and saves significant cost.
Architecture example
A typical real-time ingestion flow:
Application events → Pub/Sub → Dataflow (Apache Beam) → Storage Write API → BigQuery
Pub/Sub decouples producers from consumers and handles backpressure. Dataflow provides exactly-once processing, windowing, and late data handling. The Storage Write API writes to BigQuery with exactly-once guarantees.
Batch loading is like receiving post once a day. Streaming is like having a phone line open continuously. The phone line costs more to maintain, but you hear things the moment they happen. Most people do not need a live phone call for something that a daily letter handles perfectly well.
BigQuery Data Transfer Service
The Data Transfer Service provides pre-built connectors that automate recurring data imports into BigQuery. It handles scheduling, retries, and error reporting without custom pipeline code.
Supported sources
Data Transfer Service supports Google SaaS products (Google Ads, GA4, YouTube Analytics, Campaign Manager, Search Ads 360), cross-cloud transfers from Amazon S3 and Azure Blob Storage, scheduled loads from Cloud Storage, and scheduled queries that materialise results into destination tables. Third-party connectors are also available through partner integrations.
When to use Data Transfer Service
Use it when your data lives in a supported platform and you want regular, scheduled imports without writing custom integration code. It is particularly strong for marketing analytics use cases where you need Google Ads, GA4, or YouTube data in BigQuery on a daily schedule.
When not to use Data Transfer Service
Do not use it for real-time ingestion. Transfers run on a schedule, not continuously. It is also not suitable for sources that require custom transformation logic before loading. For those cases, build a pipeline with Dataflow or Cloud Composer.
Data Transfer Service is not limited to SaaS sources. It also supports scheduled loads from Cloud Storage and scheduled queries. If you have a recurring LOAD DATA or CREATE TABLE AS SELECT pattern, consider wrapping it in a Data Transfer configuration rather than building a Cloud Composer DAG.
CDC and near-real-time replication to BigQuery
Change data capture (CDC) replicates row-level changes from an operational database into BigQuery continuously. When a row is inserted, updated, or deleted in the source database, the change appears in BigQuery within seconds to minutes.
Datastream is GCP’s managed CDC service. It supports MySQL, PostgreSQL, AlloyDB, and Oracle as sources and writes directly to BigQuery. Datastream reads the database’s change log (binlog for MySQL, logical replication for PostgreSQL) and applies changes to BigQuery tables, keeping them in sync.
CDC is like a live transcription service for your database. Every time someone speaks (writes a row), the transcript (BigQuery table) updates within seconds. You do not need to wait for someone to summarise the whole conversation at the end of the day.
When to use CDC replication
Use Datastream when you need your BigQuery analytics tables to stay closely in sync with a transactional database. For example, when analysts need to query yesterday’s orders alongside today’s and the data must reflect recent changes, not just a nightly snapshot.
When not to use CDC replication
If a daily batch export from your database is sufficient, a simpler approach is to export to Cloud Storage and batch load into BigQuery. CDC adds operational complexity (monitoring replication lag, handling schema changes) that is only justified when you need near-real-time freshness from a database source.
Datastream and Data Transfer Service solve different problems. Datastream captures live changes from operational databases. Data Transfer Service runs scheduled imports from SaaS platforms and Cloud Storage. They are not interchangeable.
Querying external data without loading it
BigQuery external tables point to data stored outside BigQuery: in Cloud Storage, Google Sheets, or Bigtable. When you query an external table, BigQuery reads from the external source at query time. No data is copied into BigQuery managed storage.
This is not the same as loading data. External tables do not benefit from BigQuery’s columnar storage, compression, or clustering. Query performance is slower, and you cannot use features like partition pruning on the external data itself (though you can define hive-partitioned external tables on Cloud Storage).
When to use external tables
Use external tables for small, infrequently queried reference data: a cost centre mapping in Google Sheets, a product category lookup, or a one-off exploration of files in Cloud Storage before deciding whether to load them. Google Sheets external tables have a 100 MB query limit.
When not to use external tables
Do not use external tables for large datasets, frequent queries, or performance-sensitive workloads. If you query the same external data regularly, load it into BigQuery managed storage instead. The query cost and latency savings will far outweigh the storage cost.
External tables re-read from the source on every single query. There is no caching, no columnar optimisation, and no clustering benefit. On anything beyond small reference data, query performance will be noticeably worse than a native BigQuery table.
Auto-detect vs explicit schema
The —autodetect flag samples a portion of your data and infers the schema. It is convenient for exploration and one-off loads, but it makes mistakes. A column containing dates formatted as strings might be inferred as STRING when you intended DATE. A sparse column with mostly nulls can be mistyped entirely.
For production pipelines, define the schema explicitly using a JSON schema file and pass it with —schema=schema.json. Explicit schemas guarantee type correctness and make schema changes a deliberate, reviewed step rather than a silent inference that may change with new data.
Auto-detection is like asking someone to guess a recipe’s ingredients by tasting a spoonful. Usually right, occasionally wrong, and not something you rely on when the dish must be consistent every time.
Loading into partitioned and clustered tables
If you load data into a partitioned table, BigQuery routes each row to the correct partition automatically based on the partition column’s value. You do not load partitions separately.
When loading into a clustered table, BigQuery sorts data within each partition by the clustering columns during the load job. Providing source data pre-sorted by the clustering columns can speed up the load, but it is not required.
The default write disposition is WRITE_APPEND. If you re-run a load job without intending to append, you get duplicate rows. Always set —replace or WRITE_TRUNCATE explicitly when your intent is to overwrite. See BigQuery Pricing Explained for the cost implications of duplicate data.
When to use each method
You have CSV or Parquet files arriving daily. Batch load from Cloud Storage. Free, simple, and the default for most analytics teams.
You need a live fraud detection dashboard. Stream via Pub/Sub → Dataflow → Storage Write API. Accept the streaming cost for genuine real-time need.
You want Google Ads data in BigQuery for marketing reports. Data Transfer Service. No code, scheduled imports, Google-defined schemas.
Your analysts need BigQuery tables that mirror your production PostgreSQL database. Datastream CDC replication. Changes appear in BigQuery within seconds to minutes.
An analyst manages a small lookup table in Google Sheets. External table. No loading needed, but only suitable for small, infrequently queried data.
You are exploring files in Cloud Storage before committing to a schema. External table on Cloud Storage, or a one-off batch load with —autodetect into a temporary dataset.
Common beginner mistakes
- Using streaming inserts for a nightly batch job. If your data arrives once per hour or once per day, batch loading from Cloud Storage is the right approach. Streaming costs roughly 50x more per GB with the legacy API. Reserve streaming for cases where seconds of latency genuinely matter to the business.
- Relying on auto-detect in production. Auto-detect infers types from a sample of rows. If edge cases appear later, such as a column with all nulls in the sample or dates in an ambiguous format, the inferred schema will be wrong. Define schemas explicitly for all production pipelines.
- Not specifying write disposition explicitly. The default is
WRITE_APPEND. If you re-run a load job without intending to append, you get duplicate rows. Always set—replaceorWRITE_TRUNCATEexplicitly when your intent is to overwrite. - Loading CSV without explicit delimiter and encoding flags. BigQuery defaults to UTF-8 and comma delimiters. A different encoding or a tab/pipe delimiter produces garbled data or failed loads. Always specify
—field_delimiterand—encodingfor CSV files. - Using external tables for frequent, large queries. External tables re-read from the source on every query. They lack columnar optimisation and partition pruning. If you query the same data regularly, load it into BigQuery managed storage instead.
- Treating Data Transfer Service as real-time. Data Transfer Service runs on a schedule (typically daily or hourly). It does not provide streaming or near-real-time freshness. If you need data within seconds, use the Storage Write API or Datastream instead.
Summary
- Batch loads from Cloud Storage are free and the default choice for regular ETL. Avro and Parquet are preferred formats over CSV.
- The Storage Write API provides streaming ingestion with exactly-once semantics at roughly $0.025 per GB. Use it only when seconds of freshness genuinely matter.
- Data Transfer Service automates scheduled imports from Google Ads, GA4, YouTube, Amazon S3, and other sources with no custom code.
- Datastream replicates changes from MySQL, PostgreSQL, AlloyDB, and Oracle into BigQuery in near-real time via CDC.
- External tables and Google Sheets let you query data without loading it, but performance is slower and there is no columnar optimisation. This is not the same as loading data into BigQuery.
- For production pipelines, define schemas explicitly. Use
—autodetectonly for exploration. - Default write disposition is
WRITE_APPEND. SetWRITE_TRUNCATEexplicitly when your intent is to overwrite. - When in doubt, start with batch loads. Move to streaming or CDC only when you have a confirmed requirement for fresher data.
Frequently asked questions
What is the cheapest way to load data into BigQuery?
Batch loading from Cloud Storage is free. You pay nothing for the load job itself, only for storage of the source files and the destination table. Upload your data to a GCS bucket in Avro or Parquet format and run a load job. This is the default choice for any pipeline where data freshness of minutes to hours is acceptable.
When should I use streaming instead of batch?
Use streaming only when your business genuinely needs data queryable within seconds: live fraud detection dashboards, real-time bidding, or operational alerts. If stakeholders check dashboards hourly or daily, batch loading is sufficient and far cheaper. Streaming via the Storage Write API costs roughly $0.025 per GB, compared to zero for batch loads.
Is Google Sheets really loading data into BigQuery?
No. When you query a Google Sheet as an external table, BigQuery reads from the sheet at query time. The data never enters BigQuery managed storage. Performance is slower, there is no columnar optimisation, and there is a 100 MB size limit per query. Use Sheets external tables only for small, infrequently queried reference data.
Should I use auto-detect schema in production?
No. The --autodetect flag infers types from a sample and can mistype ambiguous columns: dates stored as strings, sparse columns with mostly nulls, or integers that should be strings. For production pipelines, always define an explicit schema with a JSON schema file passed via --schema=schema.json.
What is the difference between Data Transfer Service and Datastream?
Data Transfer Service runs scheduled batch imports from SaaS platforms (Google Ads, GA4, YouTube) and also supports scheduled queries and Cloud Storage transfers. Datastream is a CDC (change data capture) service that replicates changes from operational databases like MySQL, PostgreSQL, AlloyDB, and Oracle into BigQuery in near-real time. Use Data Transfer Service for SaaS and scheduled imports; use Datastream for database replication.