Learn by Directing AI

Step 1: Profile the full dataset

You have a plan, a pipeline spec, and a CLAUDE.md with the data dictionary. Time to look at the actual data.

The sample file had 30 rows. The full dataset is materials/batch-data-full.csv with 891 rows representing one day of production across all three lines. Direct Claude to profile the full dataset. You want row counts, column distributions, null patterns, and value ranges -- especially broken down by line number. Give Claude the context from your CLAUDE.md so it knows the column names and types.

Profile the full batch dataset at materials/batch-data-full.csv. I need:
- Total row count and row count by line_number
- Value ranges for temperature, humidity, chemical_concentration, color_match_score -- grouped by line_number
- Null counts by column
- Distinct values for fabric_type, dye_formula, operator_id
- Distribution of pass_fail by line_number

Step 2: Examine the profiling results

The profiling should confirm what you already know and reveal something you do not.

What you already know: Line 1 temperatures are in the 160-200 range (Fahrenheit), Lines 2 and 3 are in the 68-95 range (Celsius). The profiling confirms the conversion is needed.

What you did not know: look at color_match_score by fabric type, not by line. Polyester scores cluster in the 0-6 range. Cotton blend scores cluster in the 60-100 range. These are radically different scales sharing the same column name.

Run a query to see this clearly:

SELECT fabric_type, MIN(color_match_score), MAX(color_match_score), ROUND(AVG(color_match_score), 1)
FROM read_csv_auto('materials/batch-data-full.csv')
GROUP BY fabric_type;

Polyester shows single-digit values. Cotton blend shows double-digit values. If you average across fabric types, the result is a number between the two scales that means nothing -- the same kind of silent failure as averaging Fahrenheit and Celsius temperatures.

Step 3: Ask Roberto about the color match score

This is the second hidden constraint. Message Roberto and ask how the color match score is calculated. Ask whether it means the same thing for all fabric types.

Roberto will explain: polyester uses Delta-E measurement, where lower is better. A score under 2.0 is excellent and means the color is nearly indistinguishable from the target. A score above 2.0 fails. Cotton blends use a spectrophotometer reading on a 0-100 scale, where higher is better. A score above 95 passes. Both are called "color match score" in the production system because the QC team knows which scale applies to which fabric type.

He will be surprised you noticed. The team works with both scales daily and does not think of it as confusing. An operator who transferred from cotton blends to polyester found it confusing at first -- a story Roberto tells with a laugh.

This is a constraint that only surfaces through the right questions. The data contains valid numbers in every row. Nothing is null, nothing is out of range for its respective scale. The problem is that the same column name carries two different meanings, and the pass_fail column is calculated using different thresholds depending on which scale applies.

Step 4: Update the CLAUDE.md

Add the color match score distinction to your CLAUDE.md. Under the data dictionary entry for color_match_score, document:

Polyester: Delta-E scale, 0-6 range, lower is better, pass threshold < 2.0
Cotton blend: spectrophotometer scale, 0-100 range, higher is better, pass threshold > 95

Under known data quality concerns, add this as the second entry alongside the temperature unit issue. Every model Claude builds from this point forward will have this context. Without it, Claude would likely average color match scores across fabric types or build a single threshold for pass/fail -- both producing meaningless results.

Step 5: Build the staging models

Direct Claude to build the staging model. Be explicit about the constraints.

Build the staging model for the batch data. Constraints:
- stg_ prefix (stg_batch_data)
- Source-conform only -- no business logic
- Convert Line 1 temperatures from Fahrenheit to Celsius: (temp - 32) * 5/9
- Handle NULL humidity values -- keep the rows, don't drop them
- Do NOT normalize color match scores in staging -- that's a business logic decision for the intermediate layer
- Pass through all other columns unchanged

The constraint about color match scores matters. Without it, Claude will see the two-scale problem in the data dictionary and fix it in staging. That is the wrong layer. Temperature conversion is source-conforming -- making units consistent so the same column means the same thing on every row. Color match score normalization is business logic -- choosing how to handle two fundamentally different measurement scales. That design decision belongs in the intermediate layer, where you decide what "normalized quality score" means.

If any rows fail validation -- malformed batch IDs, temperatures outside any plausible range, impossible timestamps -- they should go to an error table rather than being silently dropped. A staging model that drops bad rows without telling you creates invisible data loss.

Step 6: Verify the staging output

Run dbt build and check the output. Query the staging model directly:

SELECT line_number, MIN(temperature), MAX(temperature), ROUND(AVG(temperature), 1)
FROM stg_batch_data
GROUP BY line_number
ORDER BY line_number;

All three lines should now show temperatures in the Celsius range -- roughly 65-100. Line 1 values that were 160-200 should now be 70-95 after conversion.

Check the color match scores:

SELECT fabric_type, MIN(color_match_score), MAX(color_match_score)
FROM stg_batch_data
GROUP BY fabric_type;

These should be unchanged from the source data. Polyester still shows 0-6. Cotton blend still shows 58-100. The staging model converted temperatures but left the color match scores alone -- exactly the layer discipline you specified.

✓ Check