Learn by Directing AI
Unit 4

Build the transformation layers and cost attribution

Step 1: Prepare a context brief

Before starting the transformation work, write a context brief for this session. The staging layer is complete -- four factory sources resolved to standard schemas with partitioning configured. What's next is the intermediate and mart layers, where cost decisions intensify.

Include in the brief: staging model names and their output schemas, the partitioning and clustering decisions from Unit 2, Fatimah's cost attribution requirements (by factory, product line, project), and any data quality observations from staging (UNVERIFIED mapping entries, null billing statuses).

Step 2: Build intermediate models

The intermediate layer unifies factory data across sources. Build two models:

  • int_deliveries_unified -- combines delivery data from all four staging models into a single view. Same standard material codes, same column names, source factory preserved.
  • int_production_unified -- combines production metrics.

Direct AI to handle the union of four staging models. The key challenge: ensuring the join produces no orphaned records (deliveries without matching materials, production records without matching factories). Each staging model already resolved material codes -- the intermediate layer adds cross-factory joins.

The same SQL that ran instantly in DuckDB now has cost implications in BigQuery. Every intermediate model that materializes as a table stores data you're paying for. Every one that materializes as a view defers cost to query time. The execution context changed even though the SQL didn't.

Step 3: Build mart models

The mart layer serves Fatimah's reporting needs. Build:

  • fct_daily_deliveries -- one row per delivery event, partitioned by date, clustered by factory and product line. The grain: one delivery to one project on one date.
  • fct_cost_attribution -- the CFO's report. Cost by factory, by product line, by project. Partitioned by date, clustered by factory and project code.
  • dim_materials -- one row per standard material code. Name, category, unit of measure.
  • dim_factories -- one row per factory. Location, capacity, production type.

For fact tables, choose the incremental strategy. fct_daily_deliveries grows over time -- new deliveries added daily. Append strategy works for event data that only grows. But if corrections come (and they will -- Fatimah's factories send corrections as re-exports), append produces duplicates. Merge handles corrections by updating existing records.

For dimensions, full table refresh. They're small and infrequently changing. The cost of rebuilding a 50-row dimension table is negligible.

Step 4: Choose incremental strategies deliberately

AI defaults to append strategy for all incremental models. This works for pure event data but silently produces stale records for data that updates in place.

For each fact table, decide: does this data only grow (append), or do existing records get corrected (merge)? The factory exports include correction duplicates -- rows with the same record ID but different values. Append keeps both. Merge keeps the latest.

Walk through the trade-off with AI. Direct it to implement the chosen strategy for each model and verify the behavior by simulating a correction scenario.

Step 5: Set up cost attribution queries

Now the query that makes this project real for Fatimah. The cost attribution report needs to show:

  • Total cost by factory (which factory delivers the most value?)
  • Total cost by product line (blocks vs pipes vs precast)
  • Total cost by project (which construction projects consume the most materials?)

Direct AI to write these queries against the mart models. Then check: are the queries using the partitioned columns in their WHERE clauses? A query that filters by date and factory on a partitioned+clustered table scans a fraction of what a SELECT * would.

Step 6: Analyze query costs

Query INFORMATION_SCHEMA.JOBS to see the cost of recent queries. In a real BigQuery environment, this shows bytes scanned and estimated cost per query. In the emulated local environment, the concept still applies -- DuckDB can report bytes processed.

Identify the most expensive query. Why does it cost what it does? Is it scanning data it doesn't need? Would partitioning or clustering reduce the scan?

Compare two versions of the same query: one that filters on the partition column (date range) vs one that scans everything. The difference demonstrates why partitioning is cost architecture, not performance tuning.

Step 7: Update Fatimah

Send Fatimah an update on the cost attribution report structure. Explain what's available: cost breakdown by factory, product line, and project. Mention that the system is designed to control query costs through partitioning -- she wanted visibility into what BigQuery would cost.

She values precision and concrete deliverables. Frame the update in terms of what the CFO can now do with this data, not the technical implementation.

✓ Check

Check: Query INFORMATION_SCHEMA.JOBS and identify the most expensive query. The student can explain why it costs what it does (bytes scanned, partitioning status) and what design change would reduce the cost.