Learn by Directing AI
All materials

CLAUDE.md

P8: BigQuery Cost Architecture + First MCP

Client

Fatimah Al-Rashidi, VP of Supply Chain at Al-Bina'a Building Materials Co. in Kuwait. Four factories producing concrete blocks, pipes, and precast elements. Supplies construction projects across Kuwait and exports to GCC neighbors. ~400 employees.

What you're building

A cloud analytics pipeline migrating Al-Bina'a's supply chain data from an on-premise database to BigQuery. The pipeline ingests data from four factories (different formats, different material coding schemes), transforms it through staging/intermediate/mart layers with cost-conscious design (partitioning, clustering, materialization strategy), implements RBAC for data governance, and provides the CFO with cost attribution by factory, product line, and project.

Tech stack

  • dbt Core with DuckDB adapter (local development) and BigQuery adapter patterns (cloud target)
  • DuckDB for local warehouse and development
  • BigQuery patterns: partitioning, clustering, INFORMATION_SCHEMA.JOBS, RBAC
  • Dagster for orchestration, scheduling, asset lineage
  • Soda Core for data quality monitoring and trend-based checks
  • MCP connections: DuckDB MCP (database access), Dagster MCP (pipeline observability)
  • Git/GitHub for version control

Data sources

Four factory exports with different formats and material coding schemes:

Factory Format Material codes Key fields
F1 CSV SAP (MAT-BLK-001) delivery_date, material_code, quantity_delivered, customer_name, billing_status
F2 CSV Internal (F2-101) ship_date, item_code, qty, buyer, billing_type, quality_inspector (PII)
F3 JSON Precast (PC-A-001) date, material, amount, client, billing
F4 JSON Precast (PC-B-001) date, material, amount, client, billing

Master mapping: material-mapping.xlsx reconciles all four coding schemes to standard codes. Three entries marked UNVERIFIED.

Schema design

Layer architecture

  • Staging: stg_factory1_deliveries, stg_factory2_deliveries, stg_factory3_production, stg_factory4_production -- source-conform, material codes resolved via mapping table
  • Intermediate: int_deliveries_unified, int_production_unified -- cross-factory joins
  • Mart: fct_daily_deliveries (partitioned by date, clustered by factory/product), fct_cost_attribution, dim_materials, dim_factories

Naming conventions

  • Staging: stg_[source]_[entity]
  • Intermediate: int_[entity]_[qualifier]
  • Fact: fct_[entity]
  • Dimension: dim_[entity]

Partitioning and clustering

  • Fact tables: partitioned by date (delivery_date/ship_date)
  • fct_daily_deliveries: clustered by factory_id, product_line
  • fct_cost_attribution: clustered by factory_id, project_code

Materialization strategy

  • Staging: view (lightweight, always current)
  • Intermediate: table (frequently joined)
  • Fact: incremental (large, append for events, merge for dimensions)
  • Dimension: table (small, full refresh)

Known data issues

  • Factory 2 quality_inspector column contains full names (PII) -- must be masked for non-quality roles
  • Material coding differs across all four factories -- mapping table required
  • Some mapping entries are UNVERIFIED -- flag in quality checks
  • Billing method varies (delivery vs milestone) -- affects when revenue appears
  • Correction records appear as duplicates with different values (no correction flag)

RBAC roles

Role Access
data_engineer Full access (debugging, development)
analyst Mart-layer only, masked PII
quality_team Quality data with inspector names visible
cfo_reporting Cost attribution views only

Quality testing strategy

  • Staging: schema drift detection, column presence, data types, value ranges, null patterns
  • Intermediate: join key correctness, material code resolution completeness
  • Mart: business logic validation (cost totals match source), freshness constraints

Monitoring

  • Business-outcome alerts (report freshness for CFO)
  • Cost alerts (daily BigQuery spend by model)
  • Trend-based monitoring (row count anomaly detection)
  • Backfill annotation mechanism

Verification targets

  • All four staging models resolve to standard material codes
  • Cost attribution totals match source sums
  • RBAC tested by querying as each role
  • Incremental models handle schema evolution after full refresh
  • Quality tests at all three layers (staging, intermediate, mart)

Commit convention

Commit after each unit of work. Messages describe what changed and why. Include the unit context in the message (e.g., "Add staging models with material code resolution").