Learn by Directing AI

Step 1: Understand the intermediate layer

In P3, you went from staging models directly to a mart. That worked because the business logic was simple enough to fit in a mart query. This time, the business logic is not simple.

The chain of custody assembly joins records from three systems through a mapping layer with cleaned tags. Yield calculations derive sawn timber volume from log volume. If you put all of that in the mart, the mart becomes a massive query that is impossible to debug. When the numbers look wrong, you would need to understand the entire pipeline in one read.

The intermediate layer sits between staging and mart. It holds business logic: cross-source joins, derived calculations, enrichment. Each intermediate model is a checkpoint -- you can query it independently to verify its output before the mart aggregates everything.

Three layers, three purposes:

Staging -- source-conform. One model per source. No business logic.
Intermediate -- business logic. Cross-source joins, calculations, enrichment.
Mart -- consumer-facing. What Francoise queries. Aggregated, readable, final.

Step 2: Build the chain of custody intermediate model

This is the core model. It connects the three systems end-to-end: forestry log -> (cleaned mapping) -> sawmill batch -> customs shipment.

Create an intermediate model at models/intermediate/int_chain_of_custody.sql. This model joins: (1) the staging forestry model to the staging mapping model on cleaned log tags, (2) the mapping model to the staging sawmill model on batch numbers, (3) the sawmill model to the staging customs model on batch numbers. For the customs join, you'll need to handle the comma-separated batch_numbers column -- UNNEST it first. Include a chain_of_custody_status column: 'complete' if a record has matches across all three systems, 'partial' if some links are missing.

Review the join logic carefully. Two things to check:

First, the forestry-to-mapping join uses the cleaned tags. Confirm that the join condition matches on the padded, trimmed tag -- not the raw tag from the source.

Second, the customs batch_numbers column is comma-separated. AI needs to UNNEST (or string_split) this into individual rows before joining. When AI uses UNNEST on semi-structured data, it commonly drops records where the value is empty or null. A customs shipment with an empty batch_numbers field would vanish from the output without any error. Check whether the UNNEST logic handles nulls.

Step 3: Build the yield intermediate model

The yield model calculates sawmill efficiency: how much sawn timber comes out of the logs that go in.

Create an intermediate model at models/intermediate/int_yield.sql. Calculate yield as sawn_timber_out_m3 / log_volume_in_m3 per sawmill batch, per species, and per concession. Include the waste_percentage from the sawmill data.

Review the calculation. AI commonly generates business logic that looks mathematically correct but calculates differently from what the business means. Yield in the timber industry means the ratio of sawn output to raw input for a specific processing batch. If Claude calculates yield across batches or aggregates before dividing, the numbers will be wrong.

Cross-check a few rows manually: pick a sawmill batch, find its sawn_timber_out_m3 and the corresponding log_volume_in_m3 from the mapping and forestry data, and verify the calculation matches.

Step 4: Decide on materialization

Each intermediate model can be a view or a table. This is a design decision with real consequences.

A view re-executes its query every time a downstream model references it. The data is always current, but every downstream query pays the computation cost. A table stores results. Downstream queries are fast, but the data is stale until the table is rebuilt.

For this project, consider: the chain of custody model is referenced by the mart and may be queried directly for ad-hoc traceability checks. The yield model is referenced only by the mart. Which should be a table? Which can stay a view?

There is no single correct answer. The trade-off depends on how the pipeline is used. In your dbt_project.yml or in the model's config block, set the materialization for each intermediate model and be able to explain why.

Step 5: Build the mart model

The mart is what Francoise's team queries. It should be readable, aggregated, and complete.

Create a mart model at models/marts/fct_shipments.sql. This model should provide: shipment-level chain of custody status, the concessions and species involved in each shipment, yield metrics, FLEGT status, and destination. Join from the intermediate chain of custody model and the intermediate yield model. This is what Francoise's team uses to answer "which shipments have complete traceability?"

The mart model should only reference intermediate models -- never staging models or raw sources. If Claude references a staging model directly in the mart, the layer separation breaks. Business logic would be split between the intermediate and mart layers with no clear boundary.

Review the mart output by querying a few rows. Each shipment should have a chain of custody status, the concessions it traces back to, and yield information. Compare the counts against materials/verification-checklist.md.

Step 6: Write custom business logic tests

In P3, dbt tests were structural: unique keys, not-null columns, accepted values. Those tests protect the pipeline's structure but do not verify that the business logic is correct.

This project needs custom business logic tests -- SQL assertions that encode what "correct" means for this specific pipeline.

Write custom dbt tests for the following business rules: (1) Every shipment with chain_of_custody_status = 'complete' must have matching records in all three upstream staging models. (2) Yield percentage must be between 0 and 100 for every batch. (3) No shipment in the mart should reference a concession_id that doesn't exist in the forestry staging model. Put these in tests/ as standalone SQL test files.

Review what Claude produces. AI consistently generates structural tests (unique, not_null) when asked for business logic tests. Check that the tests actually encode the business rules you asked for. A test that checks yield IS NOT NULL is a structural test. A test that checks yield BETWEEN 0 AND 100 is a business logic test. The difference matters.

Step 7: Run and verify

Run the full pipeline:

dbt build

All structural and business logic tests should pass. If a business logic test fails, that is a signal -- the model may be calculating something differently from what the business rule expects. Read the failure, trace it back through the intermediate layer, and determine whether the model or the test is wrong.

After the build succeeds, verify against the checklist. Query the mart for chain of custody statistics:

SELECT chain_of_custody_status, COUNT(*) 
FROM fct_shipments 
GROUP BY chain_of_custody_status;

The verification checklist expects approximately 160 complete chain of custody shipments out of 180 total. If the number is significantly different, trace back: check the intermediate chain of custody model, then the staging mapping model, then the quarantine table. The layer architecture makes this traceable -- each layer is a checkpoint.

Verify yield ranges by species:

SELECT species, MIN(yield_pct), MAX(yield_pct), AVG(yield_pct)
FROM int_yield
GROUP BY species;

Expected yield ranges are 35-65% depending on species. If any species falls outside that range, check the calculation in the intermediate model.

✓ Check

Check: dbt test passes. Chain of custody counts match checklist. Yield percentages within range.