Data Engineering: Track Setup

Complete the platform setup first if you haven't already. You should have a terminal, Claude Code, Git, and a GitHub account ready.

1. Create your track folder

mkdir -p ~/dev/data-engineering
cd ~/dev/data-engineering

2. Data engineering tools: let Claude Code do it

Open Claude Code in your track folder:

claude

Paste this prompt:

I'm setting up a data engineering environment. Please:

1. Install Python 3.11+ via Miniconda, then create a conda environment called "de"
2. Install core packages in the de environment: pandas, duckdb, dbt-core, dbt-duckdb, 
   dagster, dagster-webserver, sqlalchemy, psycopg2-binary
3. Install Docker if not already installed (or tell me how, it needs admin access)
4. Verify PostgreSQL is accessible via Docker by pulling the postgres image

After each step, verify it worked and show me the result.

Note on Docker: Docker is essential for data engineering. You'll use it for databases, orchestration, and deployment from the very first projects. If Claude Code can't install it directly, it will tell you what command to run yourself.

Verify

Once Claude Code finishes:

conda activate de
python --version
python -c "import duckdb; import dbt; import dagster; print('All packages installed')"
dbt --version
docker --version

You should see Python 3.11+, "All packages installed", a dbt version, and a Docker version.

3. Your first look

Everything is installed. Before you start Project 1, see what Claude Code can do when you point it at a data engineering problem.

Stay in your track folder with Claude Code open, and paste this:

Create two small CSV files: orders.csv (500 rows: order_id, customer_id, order_date, 
product_id, quantity, unit_price) and customers.csv (50 rows: customer_id, name, 
region, signup_date). Then build a simple dbt project that: loads both CSVs as seeds, 
creates a staging model for each, and creates a mart model that joins them into an 
order_summary with total_revenue per customer. Run dbt build and show me the results.

In a few minutes, Claude will generate source data, build a dbt project with proper staging and mart layers, and run it end-to-end. A working data pipeline from a single prompt.

As you work through the track, you'll learn why a single prompt isn't enough: why that schema design might not handle slowly-changing dimensions, why those joins might produce wrong row counts, why that pipeline needs quality tests, and why a consumer would need freshness guarantees and documentation.

But for now, look at what just happened. That's the starting point.

Ready

Start Project 1.