Learn by Directing AI
Unit 2

Plan the pipeline and set up the project

Step 1: Set up the project

Open a terminal, navigate to ~/dev, and start Claude Code. Paste this setup prompt:

Create the folder ~/dev/data-engineering/p5-textile-quality. Download the project materials from https://learnbydirectingai.dev/materials/dataeng/p5/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.

Claude downloads the materials, extracts them, and reads the CLAUDE.md that ships with the project. That file contains the data dictionary, naming conventions, and known quality concerns that will shape every model you build. Take a look at what is in the project directory after setup completes -- the materials include a pipeline spec template, a CLAUDE.md template, batch data files, a Soda Core guide, and a GitHub Actions template.

Step 2: Fill in the pipeline spec

Open materials/pipeline-spec-template.md. The requirements section is pre-filled from Roberto's brief. The design sections -- schema, layer architecture, transformation logic, quality testing strategy -- are empty. That is your job.

Fill in the template using what you learned from Roberto in Unit 1:

  • Data sources: One CSV from Roberto's production system. 891 rows in the full dataset (materials/batch-data-full.csv), representing one day of production across all three lines.
  • Known issues: Line 1 temperatures in Fahrenheit, Lines 2-3 in Celsius. The temperature conversion must happen in staging before any analysis.
  • Requirements: Re-dye rate below 5%. Quality comparison across three lines. Variable correlation analysis. Filtering by fabric type.

You do not need to fill in the schema design or transformation logic sections yet. Those come after you profile the full dataset and discover all the constraints. For now, document what you know and leave the design sections with notes about what information is still needed.

Step 3: Understand plan mode

Before you start building, decompose the work. Claude Code has a plan mode -- you ask it to plan the pipeline work, and it proposes a sequence of steps that you can review, modify, or approve before any code gets written.

Plan mode makes decomposition explicit. Instead of AI diving into code immediately and making structural decisions along the way, you see the proposed breakdown first. You can check whether the sequence makes sense: does schema design come before model creation? Do staging models come before intermediate models? Are the boundaries between pieces reasonable?

This is different from just asking AI to build the pipeline. When you decompose before starting, you can catch structural problems before they become embedded in code. A plan that puts dbt model creation before schema design will produce models that do not match the schema. A plan that combines staging and intermediate work in one step forces AI to track two concerns simultaneously.

Direct Claude to plan the pipeline work. Include the context it needs -- the data dictionary from CLAUDE.md, the temperature conversion requirement, the three-line structure. Ask it to propose a sequence for the full pipeline build.

Plan the pipeline work for Roberto's textile quality analysis. Here's what I need built:

1. Profile the full dataset (materials/batch-data-full.csv) -- row counts, distributions, null patterns, ranges by line
2. Design the schema with unit conversions for the temperature discrepancy
3. Build staging models -- source-conform only, temperature conversion, NULL handling
4. Build intermediate models -- window functions for operator trends, quality calculations
5. Build mart models -- daily quality by line, operator, fabric type
6. Add dbt tests -- structural and business logic
7. Add Soda Core trend checks -- batch count anomaly, quality score ranges
8. Configure Dagster freshness policies

Don't start building yet. Show me the plan first so I can review the sequence and dependencies.

Step 4: Review the plan

Read what Claude proposes. Check the ordering:

  • Schema design should come before any dbt model creation. The schema defines what each model produces. Writing models without a schema means every model invents its own column names and types.
  • Staging models should come before intermediate models. Intermediate models read from staging. If staging is not built yet, intermediate models have nothing to read.
  • dbt tests should come after the models they test. You cannot test a model that does not exist.
  • Soda Core configuration comes after dbt tests. Soda Core checks batch-level patterns. dbt tests check row-level properties. Get the row-level checks right first.

If the plan has steps in the wrong order, adjust it. If it combines steps that should be separate -- like building staging and intermediate models in one step -- split them. The boundaries between pieces matter. Each step should focus on one concern.

Cross-check the plan against your pipeline spec. Does it cover everything Roberto needs? The re-dye rate analysis, the cross-line comparison, the variable correlation work? If something is missing, add it before you approve.

Step 5: Create the project CLAUDE.md

Open materials/claudemd-template.md. This is a template for the project memory file -- the CLAUDE.md that Claude reads at the start of every session. The template has sections with placeholder comments. Your job is to fill it with the specifics of Roberto's project.

A well-built CLAUDE.md changes the quality of everything Claude produces downstream. When Claude knows that temperature means Fahrenheit on Line 1 and Celsius on Lines 2-3, it writes staging models that handle the conversion. When it knows that stg_ is the staging prefix and int_ is the intermediate prefix, it names models consistently. When it knows that color_match_score means different things for different fabric types, it does not blindly average across fabric types.

Without this file, you would need to repeat this context in every prompt. With it, the context is always present. The infrastructure determines the output quality.

Fill in these sections:

  • Project and client: Roberto, Textiles del Pacifico, the re-dye rate problem, the contract deadline
  • Data dictionary: Every column from the batch data, with types, descriptions, and known issues. Include the temperature unit discrepancy and any other constraints you have discovered.
  • Naming conventions: stg_ for staging, int_ for intermediate, fct_ for facts, dim_ for dimensions
  • Known data quality concerns: Temperature units, any null patterns, anything else from the profiling
  • Work breakdown: The sequence from your plan mode decomposition

Save the filled CLAUDE.md in the project root -- not in materials/. This is the file Claude reads at session start. Every model, test, and configuration you direct Claude to build will benefit from the context in this file.

✓ Check

✓ Check: Does your CLAUDE.md include the temperature unit difference between Line 1 and Lines 2-3 as a known data quality concern?