Learn by Directing AI

Step 1: Understand Dagster's asset model

Until now, you have been running the pipeline manually: a Python extraction script, then dbt build. That works for development, but it means you are the orchestrator. You decide what runs, in what order, and what to do when something fails.

Dagster replaces you as the orchestrator. In Dagster, every piece of the pipeline is an asset -- a named, versioned piece of data that the pipeline produces. Each extraction is an asset. Each dbt model is an asset. Dependencies between assets declare what depends on what, and Dagster materializes assets in dependency order.

The key concept: when you define an asset, you declare its upstream dependencies. Dagster uses these declarations to build a lineage graph. If the forestry extraction fails, Dagster knows which downstream assets cannot be materialized and skips them -- rather than running them on stale or missing data.

Copy the Dagster scaffold from materials/dagster-scaffold/ into your project. The scaffold has workspace.yaml, pyproject.toml, definitions.py (an empty Definitions object), and an empty assets/ module.

Step 2: Define extraction assets

Each of the four extraction steps becomes a Dagster asset. Direct Claude to define them.

Define four Dagster extraction assets in assets/__init__.py: extract_forestry, extract_sawmill, extract_customs, and extract_tag_mapping. Each asset should run the extraction logic from our Python script (load CSV into DuckDB, record extraction metadata). Use the @asset decorator. The tag_mapping asset depends on extract_forestry (because tag cleaning validates against forestry tags). The customs asset depends on extract_sawmill (because it references sawmill batch numbers). Register all assets in definitions.py.

Review the dependency declarations. The order matters: the tag-to-batch mapping extraction needs the forestry data to validate tags against, so extract_tag_mapping should depend on extract_forestry. The customs extraction references sawmill batch numbers, so extract_customs could depend on extract_sawmill. The forestry and sawmill extractions are independent of each other and can run in parallel.

AI commonly generates Dagster assets with correct function bodies but incorrect dependency chains. The @asset decorator accepts a deps parameter that declares upstream dependencies. If Claude hardcodes table references inside the asset function but doesn't declare them in deps, Dagster won't know about the dependency. The lineage graph will show disconnected assets that look independent but actually aren't.

Step 3: Define dbt assets

Dagster's dbt integration wraps each dbt model as a Dagster asset automatically. You don't need to define one asset per model -- the integration reads your dbt project and creates assets from it.

Configure dagster-dbt integration in definitions.py. Point it at our dbt project directory. The integration should create Dagster assets for every dbt model (staging, intermediate, mart). The dbt assets should depend on the extraction assets so that Dagster materializes extractions before running dbt.

The integration connects two systems: Dagster's asset graph and dbt's model graph. After configuration, the Dagster UI will show both extraction assets and dbt model assets in a single lineage graph. The extraction assets feed into the dbt staging models, which feed into intermediate models, which feed into the mart.

Step 4: Verify dependency declarations

Before running anything, check that the lineage graph looks right. Start the Dagster development server:

dagster dev

Open the Dagster UI at http://localhost:3000. Navigate to the asset graph. The lineage should show:

Extraction assets on the left: extract_forestry, extract_sawmill, extract_customs, extract_tag_mapping
Staging dbt models in the middle: stg_forestry__logs, stg_sawmill__batches, stg_customs__shipments, stg_tag_batch__mapping
Intermediate models: int_chain_of_custody, int_yield
Mart model on the right: fct_shipments

Edges should connect extraction assets to the staging models they feed, and the flow should proceed left to right through the layers. If any assets appear disconnected -- floating with no edges -- the dependency declarations are wrong. Go back to the asset definitions and fix the deps parameter.

Step 5: Materialize the full pipeline

In the Dagster UI, trigger a full materialization of all assets. Dagster will run the extraction assets first (respecting their dependency order), then the dbt staging models, then intermediate, then mart.

Watch the materialization progress. Each asset turns green when it succeeds. If any asset fails, Dagster marks it red and skips everything downstream of it.

After the full materialization completes, all assets should be green. This is the operational baseline -- what "the pipeline is healthy" looks like.

Step 6: Explore the Dagster UI

Navigate the Dagster UI to understand what it provides as an operational tool.

Asset lineage: Click on any asset to see its upstream and downstream dependencies. From the mart, you can trace back through intermediate models, staging models, and extraction assets to the raw source. This is the debugging path: when mart numbers look wrong, the lineage tells you where to start investigating.

Run history: The Runs page shows every materialization run with timestamps, duration, and per-asset status. Over time, this builds a pattern -- which assets fail, when, and whether failures correlate.

Per-asset materialization: Click on an individual asset to see its materialization history, including when it was last materialized, how long it took, and whether it succeeded. This is where extraction metadata and Dagster metadata meet -- the extraction scripts record row counts, and Dagster records when the extraction ran.

Step 7: Simulate a failure

Introduce a deliberate failure to see how Dagster handles it. Temporarily break the forestry extraction asset -- for example, rename the source CSV file so the extraction cannot find it.

Trigger a full materialization again. The forestry extraction fails. Watch what happens downstream: every asset that depends on forestry data (the tag-to-batch mapping, the chain of custody intermediate model, the mart) should be skipped. Assets that don't depend on forestry (the sawmill extraction, the customs extraction) may still succeed.

This is failure propagation. If the forestry extraction silently returned zero records instead of failing, the downstream models would run on empty data and produce empty output -- not an error, just wrong results. An explicit failure with skipped downstream assets is better than a silent success with meaningless output.

After observing the failure, restore the source file and run a full materialization to return to the green baseline.

✓ Check

Check: Lineage graph shows correct flow. Full materialization succeeds. Simulated failure correctly skips downstream assets.