Step 1: Load your context brief
Start a new session. Load your context brief from the last unit -- the data profiles, coding differences, and Fatimah's requirements.
This is proactive context design. Instead of re-explaining the project state to AI, you prepared a structured brief that gives AI the right starting point. Compare what AI knows with this brief loaded versus a cold start. The difference is the value of the infrastructure.
Step 2: Understand BigQuery's cost model
Before designing the schema, understand what makes BigQuery different from DuckDB. In BigQuery, every query scans data and costs money based on the volume scanned. A SELECT * on a 10GB table costs the same whether you need one column or all of them. A query that filters on a partitioned column scans only the relevant partitions. A query that doesn't filter scans everything.
This changes the economics of every design decision. In DuckDB, a full table scan costs nothing. In BigQuery, a full table scan on a large table costs real money. Partitioning by date means a query for "January deliveries" scans 1/12th of the data. Clustering by factory means a query for "Factory 1" scans only the blocks containing Factory 1 records.
Partitioning and clustering aren't performance optimizations. They're cost architecture.
Step 3: Design the staging layer
Open materials/project-spec-template.md. This is the empty pipeline specification you'll fill in as you design.
Start with the staging layer. You need staging models for each factory that:
- Read the raw exports (CSV for F1/F2, JSON for F3/F4)
- Resolve material codes to standard codes using the mapping table
- Normalize column names (delivery_date, material_code, quantity, customer_name, etc.)
- Preserve the factory_id for cross-factory analysis
Direct AI to design the staging models. AI commonly generates staging models that pass through source column names without standardizing them -- check that the output uses consistent naming across all four factories.
The naming convention: stg_factory1_deliveries, stg_factory2_deliveries, stg_factory3_production, stg_factory4_production.
Step 4: Design partitioning and clustering
Now the financial decisions. Which tables get partitioned, on which columns, and why?
For fact tables (delivery and production data), partition by date. Fatimah's reports are time-bounded -- the CFO wants this month's numbers, last quarter's trends, year-over-year comparisons. Date partitioning means those queries scan only the relevant time range.
For the main fact table, cluster by factory_id and product_line. The CFO's queries filter by "Factory 1, concrete blocks" or "all factories, pipes." Clustering on those columns means BigQuery reads only the relevant data blocks.
Direct AI to add partition_by and cluster_by configuration to the dbt model definitions. AI commonly generates models without these config options -- on small development datasets the omission is invisible because everything is fast. On production-scale tables, every query becomes a full table scan.
Step 5: Choose materialization strategy
Each model needs a materialization type. This is now a financial decision:
- Views: free to define, cost on query. Good for lightweight lookups where the underlying data is already partitioned.
- Tables: cost to build, free to query (data is stored). Good for frequently queried aggregations.
- Incremental: cost to update (only new/changed rows), free to query. Good for large fact tables where full rebuild is expensive.
For each model, decide: view, table, or incremental? For incremental models, decide the strategy: append (event data that only grows), merge (dimension data that updates), or delete+insert (replace a partition).
AI defaults to table materialization for everything. That works but misses the cost optimization. Direct AI through the trade-offs for each model.
JT Thompson might have something to say about this -- check in with him about the cost implications of your materialization choices.
Step 6: Enforce schema contracts
The schema design should include enforced data types and constraints in schema.yml. This transforms the schema from documentation into a contract. A column documented as INTEGER that silently accepts strings from a source change will produce wrong calculations downstream. With enforcement, a type violation breaks the build before bad data reaches consumers.
Direct AI to create schema.yml with enforced types and constraints for each staging model. Verify that the types match what you saw in the data profiling -- integer quantities, decimal prices, date columns, string material codes.
Step 7: Update project memory
Update your CLAUDE.md with the BigQuery patterns you've established:
- Partitioning decisions (which tables, which columns, why)
- Clustering decisions
- Materialization strategy per model
- Naming conventions (stg_, int_, fct_, dim_)
- The material code mapping approach
This ensures every future AI session on this project starts with these design decisions already in context. The session-start infrastructure now reflects the project's current state, not just the initial setup.
graph TD
subgraph "Staging Layer"
S1[stg_factory1_deliveries<br/>CSV, SAP codes]
S2[stg_factory2_deliveries<br/>CSV, F2 codes]
S3[stg_factory3_production<br/>JSON, PC-A codes]
S4[stg_factory4_production<br/>JSON, PC-B codes]
end
subgraph "Intermediate Layer"
I1[int_deliveries_unified]
I2[int_production_unified]
end
subgraph "Mart Layer"
F1[fct_daily_deliveries<br/>partition: date<br/>cluster: factory, product]
F2[fct_cost_attribution<br/>partition: date<br/>cluster: factory, project]
D1[dim_materials]
D2[dim_factories]
end
S1 --> I1
S2 --> I1
S3 --> I2
S4 --> I2
I1 --> F1
I2 --> F1
I1 --> F2
D1 --- F1
D2 --- F1
Check: The schema design includes partitioning configuration for at least two fact tables and clustering configuration for at least one dimension, with explicit rationale for each choice tied to Fatimah's query patterns (cost attribution by factory, by product, by project).