Learn by Directing AI

Step 1: Read Francoise's forwarded email

Francoise Mbeki runs operations for Bois du Littoral -- a timber export company in Douala, Cameroon. Four forest concessions, a sawmill, and export logistics to about 30 countries. She has forwarded you an internal email chain that captures the problem.

The Antwerp buyer is threatening penalties. Chain of custody documents for the March shipment are overdue. Jean-Pierre, her colleague, has been trying to match sawmill records to forestry tags manually. Three systems, no connection. This is what they deal with every shipment.

EU FLEGT regulations require proof that every piece of timber traces back to a legally managed concession. Francoise exports 40 times a year. The manual process takes a week per shipment. The math does not work.

Step 2: Message Francoise

Open the chat with Francoise. Pick one of the suggested messages to introduce yourself.

Francoise responds directly and without excess. She confirms the three systems: forestry inventory, sawmill production, customs/export. She mentions the regulatory pressure from the EU -- competitors are being shut out of the European market for non-compliance. If you ask how logs are tracked from the forest to the sawmill, she reveals a critical detail about a paper logbook at the sawmill gate. If you don't ask, she won't mention it.

Francoise's feedback is blunt. "This is acceptable" is high praise. "This doesn't work" is standard correction. She expects deadlines and doesn't soften. Her English is excellent but formal -- French business style.

Step 3: Review the pipeline spec template

Open materials/pipeline-spec-template.md. This defines what you're building and what "correct" looks like.

The requirements section is complete: chain of custody, FLEGT documentation, inventory view, yield tracking, gap detection. The schema design and layer architecture sections are empty -- you'll fill those after profiling the data.

This is different from P3, where the pipeline spec came complete. Here, the requirements are clear but the design is yours. How you structure the schema and layers depends on what you find in the data.

Read the verification targets at the bottom -- they reference materials/verification-checklist.md. Those are the numbers you'll check your pipeline against when the work is done.

Step 4: Explore the forestry data

Open materials/forestry-sample.csv. This is a 12-row sample from the full forestry dataset.

The columns: concession_id, concession_name, log_tag, species, gps_lat, gps_lon, harvest_date, harvest_permit_number, volume_m3, harvest_team. Each row is one harvested log from one of four concessions.

Notice the log_tag column. Every log gets a tag painted on the end at the forest -- a four-digit number with a leading zero: 0247, 0248, 0249. This is how the forestry system identifies individual logs.

Also notice concession_id. There are four concessions: C1 through C4. The same tag number could appear in different concessions -- tags are unique within a concession, not globally. That distinction matters.

Step 5: Explore the sawmill data

Open materials/sawmill-sample.csv. Twelve rows from the production database.

Columns: batch_number, processing_date, log_intake_count, log_volume_in_m3, sawn_timber_out_m3, waste_percentage, species, grade. Each row is one processing batch at the sawmill.

Now look for a forestry log tag anywhere in this data. It is not there.

The sawmill assigns its own batch numbers (format SB-2024-001) when logs arrive for processing. There is no field connecting a sawmill batch to the forestry tags of the logs inside it. The forestry system and the sawmill system use completely different identification schemes for the same physical timber.

This is the identity resolution challenge at the center of this project. Three systems describe the same timber as it moves through the supply chain -- and they don't share a common identifier.

Step 6: Explore the tag-to-batch mapping

Open materials/tag-batch-mapping-sample.csv. This is the bridge.

At the sawmill gate, workers record which forestry log tags entered which sawmill batch in a paper logbook. That logbook has been digitized into a CSV: forestry_log_tag, sawmill_batch_number, entry_date, recorded_by.

Now look closely at the forestry_log_tag column. Some tags have leading zeros: "0247". Some don't: "247". Some have trailing whitespace: "0248 ". The logbook was digitized manually. Nobody validated the tag format during data entry.

A forestry tag "0247" and a mapping entry "247" refer to the same physical log. But a string match won't connect them. If the cleaning doesn't handle leading zeros and whitespace, those records silently fail to match -- and the chain of custody for any shipment containing those logs has a gap. No error message. Just a missing link.

Step 7: Explore the customs data

Open materials/customs-sample.csv. Twelve rows from the export system.

Columns: export_permit_number, container_id, destination_port, destination_country, shipment_date, batch_numbers, total_weight_kg, flegt_status.

The batch_numbers field contains comma-separated sawmill batch numbers -- SB-2024-001, SB-2024-003. This connection is clean. The customs system references sawmill batches directly with matching identifiers.

So the chain of custody path is: forestry log tags -> (mapping logbook) -> sawmill batch numbers -> (direct reference) -> export shipments. The first link is messy. The second is clean. Everything depends on resolving that first link correctly.

✓ Check

Check: The student can answer: What three data sources need to be connected? What links forestry to sawmill -- and what's the problem with that link? Which downstream connection is clean?