Learn by Directing AI
Unit 3

Initialize the dbt project and configure sources

Step 1: Review the dbt project template

Before you initialize anything, look at what the template provides. Open materials/dbt-template/ and examine the files inside.

The template has three configuration files and two empty directories:

  • dbt_project.yml -- the project manifest. It names the project branzeria_carpati and tells dbt where to find models, tests, and seeds.
  • profiles.yml -- the connection configuration. It points dbt at a local DuckDB file (branzeria_carpati.duckdb) using the duckdb adapter. This is already configured. You do not need to edit it.
  • models/schema.yml -- currently empty. This is where you will define sources and, later, tests.
  • models/staging/ -- empty directory, waiting for staging models.
  • models/marts/ -- empty directory, waiting for mart models.

This template is scaffolding. You did not have to decide the project name, the adapter, or the directory layout. Those decisions were made for you. In later projects that scaffolding disappears -- you make those choices yourself. For now, the template lets you focus on sources and models without configuring the framework from scratch.

Step 2: What dbt is (and what it is not)

dbt is a transformation framework. It takes the SQL you already know -- the same SELECT statements, JOINs, and aggregations from P1 and P2 -- and wraps them in a structure that tracks dependencies, enforces naming, and runs automated tests.

A dbt project is a collection of .sql files called models. Each model is a SELECT statement that produces one table or view. The framework adds three things raw SQL scripts lack:

  1. Dependency tracking. dbt knows which models depend on which other models. It builds them in the right order automatically. You do not write a script that says "run staging first, then marts." dbt figures that out from the dependencies you declare.
  2. Naming conventions. Models live in directories (staging/, marts/) and follow naming patterns (stg_production, fct_variety_profitability). The structure makes the pipeline navigable -- anyone opening the project can see what each layer does.
  3. Automated tests. You declare expectations about the data (this column should be unique, that column should never be null) and dbt checks them after every run.

The SQL inside each model is identical to what you wrote in P2. dbt does not replace SQL -- it organizes it.

Step 3: Initialize the dbt project from the template

This maps to T2 in the ticket backlog from materials/CLAUDE.md: "Initialize dbt project from template and configure source definitions." Direct Claude to copy the template into your working directory and verify the connection:

Copy the dbt-template directory from materials/ into the project root as branzeria_carpati. Then run dbt debug from inside the branzeria_carpati directory to verify the DuckDB connection works.

Claude copies the template files, creating the branzeria_carpati/ directory with the full project structure. Then dbt debug runs a series of checks: it validates dbt_project.yml, confirms profiles.yml is readable, and tests the database connection.

The output shows each check with a pass/fail status. The final line should read "All checks passed!" -- that confirms the project skeleton is in place and dbt can reach DuckDB.

If dbt debug fails, the most common cause is a path issue -- profiles.yml expects to run from inside the branzeria_carpati/ directory, not from the project root. Make sure Claude runs the command from the right location.

Step 4: Configure source definitions

The dbt project is connected, but it does not know about your raw data yet. The three tables you loaded in Unit 2 -- production_log, sales, milk_purchases -- exist in DuckDB, but dbt has no reference to them. Source definitions create that link.

Open branzeria_carpati/models/schema.yml. Direct Claude to add source definitions:

In the schema.yml file, define a dbt source called 'raw' with three tables: production_log, sales, and milk_purchases. These are the DuckDB tables we loaded in Unit 2.

The schema.yml should look something like this after Claude writes it:

version: 2

sources:
  - name: raw
    schema: main
    tables:
      - name: production_log
      - name: sales
      - name: milk_purchases

Source definitions are a declaration: "these raw tables exist, and this is how dbt should refer to them." When you write a staging model later, you will use source('raw', 'production_log') instead of referencing the table name directly. That indirection matters -- if the source table name changes, you update the definition in one place instead of editing every model that reads from it.

Check that the table names in schema.yml match exactly what you created in DuckDB. If you named the table production_log in Unit 2 but the source definition says production_logs (plural), dbt will fail when it tries to read from a table that does not exist. AI commonly generates source definitions with default or assumed table names rather than matching the actual tables in the database -- verify the names before moving on.

Step 5: Understand ref() and source() -- the DAG

Two functions define how dbt models connect to each other and to raw data.

source() reads from a raw table. When you write source('raw', 'production_log') in a staging model, you are telling dbt: "this model reads from the production_log table defined in our source configuration." dbt uses this to track where your data comes from.

ref() reads from another model. When you later write ref('stg_production') in a mart model, you are telling dbt: "this model depends on the stg_production staging model. Build that one first." dbt uses this to determine build order.

Together, these two functions build the DAG -- the directed acyclic graph. The DAG is a map of your entire pipeline: which models read from which sources, which models depend on which other models, and in what order everything needs to run. In P1 and P2, you managed build order manually -- "run the staging script before the mart script." dbt eliminates that manual coordination. You declare the dependencies, and the framework handles the rest.

The DAG for this project will look like this once all models are built:

graph LR
    PL[production_log] --> SP[stg_production]
    S[sales] --> SS[stg_sales]
    MP[milk_purchases] --> SPR[stg_purchases]
    SP --> FVP[fct_variety_profitability]
    SS --> FVP
    SPR --> FVP

Three raw sources feed three staging models. All three staging models feed one mart. dbt reads the source() and ref() calls in your SQL and builds this graph automatically. When you run dbt run, it executes the staging models first (they have no model dependencies, only source dependencies) and the mart last (it depends on all three staging models).

This is what ref() actually solves. In P2, if you ran the mart script before the staging scripts, it would either fail or produce wrong results. With dbt, that mistake is impossible -- the framework enforces the order.


✓ Check

✓ Check: dbt debug passes. Source definitions exist for all three raw tables. The student can explain: what does ref() do? What does source() do? Why does dbt need to know about dependencies?