How to Build a Cloud Data Pipeline Portfolio Project

A cloud data pipeline portfolio project sits at the intersection of cloud engineering and data engineering — which makes it useful for two different audiences: cloud engineers at data-heavy companies, and data engineers who work on cloud platforms. This guide covers three tiers of pipeline complexity so you can choose one that fits your current level and target role.

Who this project is for

A data pipeline project is most valuable if you are targeting:

  • Cloud data engineer roles (focused on GCP BigQuery, AWS Redshift, or Databricks infrastructure)
  • Analytics engineering roles where you need to demonstrate pipeline thinking alongside SQL skills
  • Cloud platform engineer roles at data companies (building the infrastructure data teams run on)
  • Generalist cloud engineer roles where the job description mentions data engineering tools

If you are targeting a pure infrastructure, DevOps, or SRE role at a company that does not do significant data work, a data pipeline project adds less value than a Kubernetes or CI/CD project. Choose the project type that fits your target role.

Tier 1: Batch data pipeline (beginner level)

What to build

Build a pipeline that extracts data from a public source, transforms it, and loads it into a cloud data warehouse. Use a publicly available dataset — government open data, a sports statistics API, a public S3 dataset.

Architecture:

  • Extract: Python script pulling from a public API or downloading a CSV file
  • Storage: store raw data in an S3 bucket or GCS bucket (the data lake layer)
  • Transform: Python or SQL transformations on the raw data
  • Load: insert the cleaned data into BigQuery (GCP) or Redshift/Athena (AWS)
  • Orchestrate: schedule the pipeline to run daily using Cloud Scheduler + Cloud Functions (GCP) or EventBridge + Lambda (AWS)

What this demonstrates

The ELT pattern (extract, load, transform — loading raw data first and transforming in the warehouse) is the current standard for cloud data pipelines. Understanding the difference between ETL and ELT, and why the ELT model became dominant with the rise of cheap cloud storage and powerful data warehouses, is a common data engineering discussion topic.

What to document

Explain why you separated raw storage from transformed data (data lake vs data warehouse). Explain how the orchestration works and what happens if the extract step fails. Explain the schema of the data you chose to work with and any transformations you applied.

Tier 2: Orchestrated pipeline with failure handling (intermediate level)

What to build

Extend the batch pipeline with a proper orchestration tool and real failure handling. Use Apache Airflow (managed via Cloud Composer on GCP, or MWAA on AWS) or Prefect (simpler to self-host for a portfolio project).

The pipeline should:

  • Define a DAG (directed acyclic graph) with task dependencies — extract, validate, transform, load are separate tasks
  • Implement data quality checks: if the extracted dataset has fewer rows than expected, fail the pipeline and alert rather than loading bad data
  • Handle partial failures: if the transform step fails, the extract does not need to re-run
  • Log pipeline runs with success/failure status and duration
  • Send a notification on pipeline failure (Slack webhook or email via SES/SendGrid)

What this demonstrates

Data quality validation before loading is one of the most important real-world practices in data engineering and one of the most commonly overlooked in portfolio projects. A pipeline that silently loads bad data is worse than no pipeline. Adding a validation step — with a clear definition of “good enough” — shows data engineering maturity.

What to document

Explain the data quality rules you defined: what constitutes a valid dataset? What threshold did you set for “too few rows” and why? Explain how the DAG handles partial failures — which tasks are idempotent (safe to re-run) and which are not?

Tier 3: Streaming data pipeline (advanced level)

What to build

Build a real-time streaming pipeline that processes events as they arrive rather than in batches. This is more complex but highly relevant for roles at companies doing real-time analytics, event-driven architectures, or IoT.

Architecture:

  • Producer: a script or service generating events and publishing to GCP Pub/Sub, AWS Kinesis, or Apache Kafka
  • Consumer: a streaming processor using Dataflow (GCP), Kinesis Data Analytics (AWS), or a custom consumer application
  • Storage: events written to BigQuery, DynamoDB, or a time-series database like InfluxDB
  • Monitoring: metrics on throughput, lag, and error rate for the consumer

What this demonstrates

Streaming pipelines require reasoning about delivery guarantees (at-least-once, exactly-once), consumer lag, backpressure, and ordering. These are concepts that batch pipelines do not require. Demonstrating that you have thought through message delivery semantics — for example, why exactly-once processing is hard and what you traded off to achieve at-least-once — is advanced data engineering thinking.

What to document

Explain the delivery guarantee your pipeline provides (at-least-once is the most common achievable guarantee). Explain what happens if the consumer falls behind the producer (consumer lag). Explain how your consumer handles duplicate messages if your guarantee is at-least-once.

Infrastructure as code for data pipelines

All cloud infrastructure — Pub/Sub topics, Kinesis streams, BigQuery datasets, S3 buckets, IAM roles — should be provisioned with Terraform. Data pipeline teams at serious companies use Terraform for infrastructure alongside whatever orchestration tool they use for the pipeline logic.

The IAM design is important: the pipeline service account or execution role should have only the permissions it needs. A pipeline that extracts from an API and loads into BigQuery does not need access to your VPC, your Kubernetes cluster, or any other unrelated resource.

Choosing your data source

Public data sources worth using for a portfolio project:

  • Open government datasets (data.gov, data.gov.uk, data.europa.eu)
  • Financial data APIs with free tiers (Alpha Vantage, Yahoo Finance)
  • Weather data APIs (Open-Meteo is free with no API key)
  • Sports data APIs (football data APIs, NBA statistics)
  • AWS public datasets (the Registry of Open Data on AWS)
  • GCP public datasets in BigQuery (accessible from the BigQuery console)

Choose a dataset that has enough rows to make the pipeline non-trivial but not so large that it generates costs. A dataset of a few million rows is ideal.