Learn by Directing AI

Step 1: Configure dbt sources

Copy the dbt scaffold from materials/dbt-scaffold/ into your project's working directory. The scaffold has dbt_project.yml, profiles.yml, and empty directories for staging/, intermediate/, and marts/.

In P3, the source configuration was provided for you. This time, you create it. Direct Claude to create a schema.yml file in the models/staging/ directory that declares all four sources from the DuckDB tables you loaded in Unit 2: forestry, sawmill, customs, and tag_batch_mapping.

Create a schema.yml file in models/staging/ that declares dbt sources for the four tables we loaded into DuckDB: forestry, sawmill, customs, and tag_batch_mapping. Include column descriptions based on the data dictionary in CLAUDE.md.

The source configuration tells dbt where raw data lives. Every staging model will reference these sources with {{ source('...', '...') }} instead of hardcoding table names.

Step 2: Decide on naming conventions

Before building any models, decide on the naming conventions your staging models will use. The CLAUDE.md governance file specifies the convention pattern: stg_ prefix for staging, int_ for intermediate, fct_/dim_ for marts. But the specific model names are yours to decide.

Consider: stg_forestry__logs, stg_sawmill__batches, stg_customs__shipments, stg_tag_batch__mapping. The double underscore separates the source from the entity -- a convention from dbt's community patterns.

Whatever you choose, write it down. Every model, every downstream reference, every test will use these names. A naming convention is a communication system -- it tells you (and anyone reading the pipeline later) what a model does and where it sits in the architecture.

Step 3: Build the forestry staging model

Direct Claude to build the first staging model for the forestry source.

Create a staging model at models/staging/stg_forestry__logs.sql. It should source-conform the forestry table: select all columns from the source, cast types explicitly (dates as DATE, volumes as DECIMAL), rename any columns to match our naming convention. Do not apply any business logic -- staging models only clean and rename.

Review what Claude produces. Staging models do three things: select from the source, cast types, and rename columns. They do not filter rows, calculate derived values, or join to other tables. If Claude added any transformation beyond type casting and renaming, that logic belongs in the intermediate layer, not here.

Step 4: Build the sawmill staging model

Same approach, different source structure.

Create a staging model at models/staging/stg_sawmill__batches.sql. Source-conform the sawmill table: cast types, rename columns. No business logic.

The sawmill source has its own identification scheme (batch_number in SB-2024-NNN format), its own columns (waste_percentage, grade), and its own structure. The staging model brings it into your convention without trying to connect it to anything else. That connection happens later.

Step 5: Build the tag-to-batch mapping staging model

This is the critical staging model. The tag-to-batch mapping is where identity resolution happens -- cleaning the forestry log tags so they can be matched across systems.

Create a staging model at models/staging/stg_tag_batch__mapping.sql. This model should: (1) trim whitespace from forestry_log_tag, (2) pad forestry_log_tag to 4 digits with leading zeros (so "247" becomes "0247"), (3) cast types and rename columns. Route any records where the tag still can't be resolved to a valid format to the error quarantine table.

Review the cleaning logic carefully. The profiling in Unit 2 showed specific inconsistencies: tags missing leading zeros ("247" instead of "0247"), tags with trailing whitespace ("0248 "), and some entries in a format that doesn't match the expected pattern. The cleaning logic needs to handle each of these.

After cleaning, some records will still not resolve to a valid tag format. Those go to the error quarantine table you set up in Unit 2. The staging model should not drop them silently.

AI commonly generates cleaning code that handles the obvious cases (whitespace trimming) but misses the edge cases (tags in unexpected formats). Check that the LPAD or equivalent logic actually handles all the cases you found during profiling.

Step 6: Build the customs staging model

Create a staging model at models/staging/stg_customs__shipments.sql. Source-conform the customs table. The batch_numbers column contains comma-separated values -- leave it as-is for now. The intermediate layer will handle the UNNEST.

The customs source references sawmill batch numbers directly with matching identifiers. No cleaning needed for the join. The batch_numbers column is semi-structured (comma-separated in a single field), but that complexity belongs in the intermediate layer when you assemble the chain of custody.

Step 7: Add schema tests

Direct Claude to add dbt schema tests for the staging models.

Add dbt schema tests to the staging layer schema.yml: unique and not_null tests on primary keys, accepted_values where appropriate, and relationships tests between the mapping model and the forestry staging model. Add tests based on what we know about the data from profiling.

Review what Claude generates. Check for tautological tests -- tests that verify something the database already guarantees or that can never fail given the data's structure. A unique test on a column that is already the primary key of a source table tells you nothing new. A not_null test on a column you just profiled and found zero nulls in is structural insurance, not insight.

The valuable tests are the ones that would catch real problems: relationships between the mapping model's cleaned tags and the forestry model's tags, accepted values for concession_id (C1 through C4), and not_null on columns where a null would indicate a broken extraction.

Step 8: Run and verify

Run the staging layer.

dbt build --select staging

All tests should pass. If any fail, read the error message and fix the model before moving on.

After the build succeeds, verify row counts. Query each staging model:

SELECT 'stg_forestry' AS model, COUNT(*) AS rows FROM stg_forestry__logs
UNION ALL SELECT 'stg_sawmill', COUNT(*) FROM stg_sawmill__batches
UNION ALL SELECT 'stg_customs', COUNT(*) FROM stg_customs__shipments
UNION ALL SELECT 'stg_mapping', COUNT(*) FROM stg_tag_batch__mapping;

Check the counts against the verification checklist. Then check the critical equation: the number of rows in the mapping staging model plus the number of rows in the error quarantine table should equal the source total of 480 rows. If the numbers don't add up, records were lost somewhere -- either the staging model is dropping rows without quarantining them, or the quarantine logic is incomplete.

✓ Check

Check: dbt test passes. Row counts match. Staging + quarantine = source total for mapping.

Build the staging layer