Learn by Directing AI
Unit 8

Schema evolution, final delivery, and close

Step 1: Simulate schema evolution

Factory 1 has added a new column to their export: delivery_priority (values: STANDARD, URGENT, CRITICAL). In production, this happens without warning. Source systems change. Exports gain columns, lose columns, rename columns.

Add a delivery_priority column to your local factory1-export.csv. Add values for a handful of rows -- STANDARD for most, a few URGENT, one CRITICAL. Then run the incremental model for fct_daily_deliveries.

Check the output. Does the new column appear in the materialized table?

It should not. Incremental models, by default, only process new rows according to the incremental strategy. They don't re-evaluate the schema. The model was built with the old schema, and it continues producing the old schema -- even though the source now has a new column. The data is there. The model ignores it.

Step 2: Handle the schema evolution

The fix is an explicit full refresh. Direct AI to run a full refresh of the affected model:

dbt run --full-refresh --select fct_daily_deliveries

After the full refresh, query the table again. The delivery_priority column should now appear.

This is the core behavior: schema changes in source data don't automatically propagate through incremental models. AI's default behavior -- running the incremental model without noticing the schema change -- produces a model whose schema silently drifts from its source. The model runs. No errors fire. The data is just incomplete.

In a production system, this is how a new column stays invisible for weeks -- until someone asks "why isn't delivery priority in the report?" and the answer is "it was always in the source, the model just never picked it up."

Step 3: Add schema evolution defense

Now that you've seen the failure mode, add a defense. Direct AI to implement a schema drift test at the staging layer:

The test should compare the columns in the source (the CSV/JSON file) against the columns the staging model expects. If the source has columns the model doesn't reference, flag a warning. If the source is missing columns the model depends on, flag an error.

This is the staging-layer defense from Unit 5 extended to schema evolution. The quality testing strategy now catches not just bad data but structural changes in the data.

Step 4: Handle Fatimah's scope creep

You've delivered the cost attribution report. Fatimah reviews it and responds:

This meets the requirement. The CFO can use this for the board meeting next week.

One more thing -- the procurement team also wants to query supplier data. Delivery times, quality reject rates by supplier. Can we bring that in too?

This is scope creep. It happens at exactly this moment in every project: the first successful delivery proves the system works, which generates new requests.

The procurement/supplier data is a new data domain. It doesn't fit the existing schema -- delivery times and reject rates by supplier are different from production deliveries and cost attribution. Bringing it in means new staging models, new intermediate joins, new mart tables, new tests, new RBAC considerations.

Direct AI to draft a response to Fatimah. The response should:

  1. Acknowledge the request and confirm the system could support it
  2. Scope what the work would involve (new data sources, new models, new testing)
  3. Recommend treating it as a Phase 2 project, not an extension of the current delivery

This is professional scope management. The answer is not "no" and the answer is not "sure, I'll add it." The answer is "yes, and here's what that involves."

Step 5: Prepare the final delivery

Assemble the deliverables Fatimah needs:

  1. Cost attribution report -- query results showing cost by factory, by product line, by project. The CFO's deliverable.
  2. Monitoring documentation -- what alerts are configured, what they watch, who gets notified. The operations handoff.
  3. RBAC documentation -- which roles exist, what each can access, the testing evidence. The governance record.
  4. Quality testing strategy -- what's tested, at what layer, what gaps remain. The maintainability guide.

Direct AI to compile these into a project README. The README is the documentation artifact -- a new engineer reading it should understand: what was built, why the cost architecture decisions were made, how RBAC is configured, what monitoring is in place, and what known limitations exist.

Step 6: Update project memory

The project memory (CLAUDE.md and any AGENTS.md) should accurately reflect the current state. Check: does the project memory reference BigQuery as the target system? Or does it still describe the DuckDB prototype?

If the memory references DuckDB as the production target, update it. The cross-platform context pattern matters here: the same project memory should work across AI tools (Claude Code, Codex CLI, Cursor). A memory file that says "this project uses DuckDB" when the production target is BigQuery will misguide any AI that reads it.

Verify:

  1. The CLAUDE.md accurately describes the BigQuery pipeline (not the DuckDB development environment)
  2. The technology stack references are correct
  3. The RBAC roles, monitoring strategy, and quality testing approach are documented
  4. Known limitations and the deferred procurement scope are noted

Step 7: Push and close

Push the project to GitHub. Include the README, all dbt models, the Dagster pipeline configuration, test files, and monitoring configuration.

Review the repository one final time. Everything Fatimah asked for is delivered: cost attribution by factory, product line, and project. Cloud costs are visible and controllable through partitioning and the INFORMATION_SCHEMA analysis. Inspector names are restricted to the quality team. Monitoring alerts when business-critical data goes stale.

What's deferred: procurement/supplier data integration (Phase 2, pending Fatimah's approval of scope and timeline).

✓ Check

Check: After simulating schema evolution (new column in Factory 1), confirm: (1) the incremental model initially ignores the new column, (2) after full refresh, the new column is present, (3) the project memory file accurately references BigQuery as the target, not DuckDB.

Project complete

Nice work. Ready for the next one?