Learn by Directing AI

Step 1: Understand MCP connections

Until now, every time you needed AI to know about your database, you described it: "the orders table has columns order_id, customer_id, order_date." AI worked from your description. If your description was incomplete or inaccurate, AI's output inherited those gaps.

MCP (Model Context Protocol) changes that. When you connect AI to a database via MCP, AI reads the schema directly. It explores tables, verifies column types and names, runs queries against live data. The shift from "describe the data to AI" to "AI reads it directly" is qualitative, not just convenient.

The same protocol works across AI coding tools. The DuckDB MCP server you're about to connect works in Claude Code, Codex CLI, Cursor -- same protocol, different tool. The transferable knowledge is the standard.

Step 2: Connect the DuckDB MCP server

Connect the DuckDB MCP server to Claude Code. This gives AI direct access to your DuckDB database -- it can read table schemas, run queries, and verify data without you describing anything.

After connecting, test the capability shift. Ask AI about a table that exists in your database. Compare the response quality to what you'd get without the connection -- AI should reference actual column names, data types, and sample values it read directly.

Then ask about a table that doesn't exist. AI should tell you it's not there rather than guessing.

Step 3: Connect the Dagster MCP server

Connect the Dagster MCP server. This gives AI direct access to pipeline status, asset lineage, and run history.

After connecting, ask AI about your pipeline's current state. Which assets exist? Which have been materialized? What's the status of the last run?

This is pipeline observability through AI. Instead of navigating the Dagster UI yourself, you can ask natural language questions and AI reads the answers from Dagster directly. The same protocol applies -- Dagster MCP works across AI tools.

Step 4: Build staging models for Factory 1 and 2

With MCP connected, build the staging models. Start with Factory 1 (stg_factory1_deliveries):

Direct AI to create a dbt staging model that:

Reads from factory1-export.csv
Joins to the material mapping table to resolve SAP codes to standard codes
Standardizes column names (delivery_date, standard_material_code, quantity, customer_name, factory_id)
Adds partitioning and clustering configuration

AI can now read the actual CSV schema via MCP instead of relying on your description from the profiling step. The model quality should reflect this -- correct column references, accurate types.

Then build stg_factory2_deliveries for Factory 2. Different source columns (ship_date, item_code, qty, buyer) mapping to the same standard output schema. The material mapping joins on F2 internal codes instead of SAP codes.

Watch for AI omitting partition_by and cluster_by in the dbt model config. On development-size data this omission is invisible. On production tables, it's the difference between scanning one partition and scanning everything.

Step 5: Build staging models for Factory 3 and 4

Build stg_factory3_production and stg_factory4_production. These read JSON instead of CSV, use PC-A/PC-B material codes, and have yet another set of field names (date, material, amount, rate, client).

The output schema should match Factory 1 and 2's staging models exactly. Same column names, same types, same material code standard. Four different inputs, one standard output.

Run dbt build and verify all four staging models compile and produce data. Check row counts against what you saw in the profiling step.

Step 6: Verify material code resolution

This is the critical check. Query each staging model and verify the material codes are standard, not factory-specific:

SELECT COUNT(DISTINCT standard_material_code) as unique_materials,
       COUNT(*) as total_rows
FROM stg_factory1_deliveries;

Run the same query on all four staging models. The standard_material_code column should contain codes like STD-BLK-001, not MAT-BLK-001 or F2-101 or PC-A-001.

Also check: did the UNVERIFIED mapping entries cause any issues? How many rows in each staging model have standard codes that came from UNVERIFIED mappings? This is a data quality signal you'll want to test for in Unit 5.

✓ Check