The Brief
Francoise Mbeki runs timber export operations for Bois du Littoral out of Douala, Cameroon. The company manages four forest concessions, a sawmill, and export logistics to about 30 countries. EU buyers require FLEGT chain of custody documentation -- proof that every piece of timber in a shipment traces back to a legally managed concession.
The problem is three systems that don't talk to each other. The forestry inventory tracks log tags, species, and GPS coordinates by concession. The sawmill production database tracks processing batches, yield, and grades -- using its own batch numbers that have no connection to the forestry tags. The customs system tracks export permits and shipping manifests, referencing sawmill batch numbers.
When an EU buyer asks for chain of custody documents, Francoise's team manually traces records across all three systems. It takes a week per shipment. They export 40 times a year. The Antwerp buyer is already threatening penalties.
Your Role
You're building the pipeline that connects these three systems -- a chain of custody from forest concession to export port. The pipeline extracts data from three sources into DuckDB, resolves identity across systems that use different identification schemes, and produces a mart that Francoise's team can query for any shipment's traceability status.
This is the first project where the pipeline has enough moving parts to need orchestration. You'll use Dagster to coordinate extraction and transformation, and its asset lineage to trace problems when numbers look wrong.
You'll direct Claude Code through multi-source work. When you ask it to join data across sources, it will match on column names. Whether those names mean the same thing across systems is your judgment call.
What's New
Last time you built a dbt project from three data sources for Mihai's cheese operation. The sources were straightforward CSV files that loaded into staging models with naming conventions you applied.
This time, the sources have different identification schemes. The forestry system uses log tags. The sawmill uses batch numbers. No field directly connects them. A digitized logbook bridges the gap -- but the logbook has inconsistencies. Records that can't be matched don't just disappear into wrong counts. They go to an error quarantine table so you know exactly what's missing and why.
The transformation layer gets deeper too. In P3, you went from staging to mart. Here you'll add an intermediate layer for the business logic that's too complex to put in a mart query -- the cross-source chain of custody assembly, yield calculations, traceability status. And Dagster enters for the first time, turning your pipeline from something you run manually into something with dependency tracking, lineage, and failure propagation.
Tools
- Python -- via your Miniconda
deenvironment - DuckDB -- analytical database for all sources
- SQL -- transformation logic inside dbt models
- dbt Core + dbt-duckdb adapter -- transformation framework
- Dagster -- pipeline orchestration (new this project). The unit that introduces Dagster walks through asset definitions and the UI.
- Claude Code -- your AI agent
- Git / GitHub -- version control
Materials
You'll receive:
- Pipeline spec template -- requirements are filled in, schema design sections are yours to complete after profiling
- Sample data from each source -- forestry, sawmill, customs, and the tag-to-batch mapping logbook
- Full datasets -- the complete data for each source
- dbt project scaffold -- directory structure and configuration, no models
- Dagster project scaffold -- basic structure, no assets defined
- Verification checklist -- row counts, match rates, chain of custody statistics, yield ranges
- Project governance file -- CLAUDE.md with data dictionary, naming conventions, and ticket backlog