Introduction to Data Engineering
What data engineers actually do
Data engineering is the infrastructure layer. If data scientists answer questions and analysts build dashboards, data engineers make sure the data gets there. Correctly, on time, every time. Here are the main roles:
Data Engineer. Builds and maintains the pipelines that move data from sources to destinations. Ingestion, transformation, quality checks, orchestration. The broadest role and the one this track most closely follows.
Analytics Engineer. Sits between data engineering and analytics. Builds the transformation layer (typically in dbt) that turns raw data into clean, tested, documented models that analysts can query. A newer role that's become standard.
Platform / Infrastructure Engineer (data). Focuses on the systems that pipelines run on. Cloud infrastructure, container orchestration, cost management, permissions, networking. Makes sure the platform is reliable and efficient.
Data Reliability Engineer. Focuses on data quality, monitoring, and SLAs. Detects when data is late, wrong, or missing before downstream consumers notice. A specialized role at larger companies.
These roles overlap. At a small company, one data engineer does all of it. The underlying workflow is the same.
The professional loop
Every data engineering project, whether it's a simple batch pipeline or a complex streaming architecture, moves through the same cycle:
1. Consumer need and source profiling. Who needs this data? In what shape? How fresh? Where does it come from? What condition is it in? You can't build a pipeline without understanding both ends.
2. Contract and schema design. What's the contract between the data producer and consumer? What schema organizes the data? What do the field names mean? Decisions made here (grain, naming, slowly-changing dimensions) echo through everything downstream.
3. Ingest and land raw. Get the data from source to landing zone without changing it. Batch loads, CDC (Change Data Capture), API pulls, file drops. The raw layer is your safety net. You can always reprocess from it.
4. Transform, test, and enrich. Turn raw data into something useful. Staging, intermediate, and mart layers. Data quality tests at every stage. This is where dbt lives, and where most data engineering code is written.
5. Publish and serve. Make the transformed data available. Warehouse tables, API endpoints, reverse ETL to business tools. The data reaches the people who need it.
6. Observe. Watch the system. Lineage (what depends on what), freshness (is the data current), cost (what are we spending), SLAs (are we meeting our commitments). Without observability, failures are invisible until someone complains.
7. Govern. Access control, PII handling, retention policies, compliance. Who can see what data? How long do we keep it? What regulations apply? Governance is increasingly a hiring requirement.
8. Evolve safely. Schema migrations, backfills, deprecations. The system changes over time: adding columns, replacing sources, retiring tables. Doing this without breaking downstream consumers is the hardest part of data engineering.
You'll run this loop in every project. What changes is the complexity: early projects give you a clean CSV and a specified schema. Later projects give you multiple conflicting sources, streaming data, and consumers who need different things from the same pipeline.
What you'll work on
Each project is built for a client with a specific problem. You'll direct AI to build pipelines, transformation layers, and quality systems, then verify whether the data arrives correctly and the system handles failure gracefully. Here's a sample:
- A batch pipeline that ingests, transforms, and serves data from a single source
- A dbt project that builds a tested, documented transformation layer
- A pipeline with data quality checks that catch problems before they reach consumers
- An orchestrated system with multiple sources and dependencies
- A streaming pipeline that processes events in near-real-time
- A pipeline migration that replaces a source without breaking downstream consumers
The projects get harder in specific ways. The sources multiply. The schemas conflict. The freshness requirements tighten. The governance rules get real. You move from single-source batch to multi-source streaming with compliance constraints. And throughout, AI is your primary tool, good at writing SQL and pipeline code, but prone to specific mistakes with data quality, schema design, and pipeline logic that you'll learn to catch.
Core tools
These are the tools data engineers use daily. You'll set up the core ones in the track setup; the rest are introduced as projects need them.
Terminal. Your command line. Everything runs through it.
Claude Code. Your AI coding agent. You'll direct it to write SQL, Python, dbt models, pipeline configs, and infrastructure code. It's strong at generating transformation logic, and it makes specific, predictable mistakes with schema design and data quality that you'll learn to catch.
Git and GitHub. Version control. Every project lives in a repository.
Python. Used for ingestion scripts, custom transformations, and orchestration. The standard language for data engineering beyond SQL.
SQL. The language of data transformation and querying. You'll write more SQL than Python in this track.
DuckDB. A fast, local analytical database. You'll use it for development and testing before deploying to a cloud warehouse.
dbt Core. The transformation framework. You write SQL models, dbt handles dependencies, testing, documentation, and materialization. The dominant tool for the transform layer in 2026.
Dagster. Pipeline orchestration. Defines what runs when, in what order, with what dependencies. The track uses Dagster's asset-oriented approach: you define what data assets exist and how they're materialized.
PostgreSQL. A relational database. Used as a source system and as infrastructure for other tools.
Docker. Packages your pipelines into containers. Essential for reproducible, deployable data infrastructure.
You'll install additional tools as the track progresses: BigQuery for cloud warehousing, Kafka for streaming, Soda Core for data quality, and others. Each project tells you what's needed.