Learn by Directing AI
Unit 3

Design the extraction strategy

Step 1: Fill the pipeline spec template

Open materials/pipeline-spec-template.md. This is the structure for your extraction design -- sources, target schema, extraction pattern per source, refresh strategy per source.

Fill it in with what you know:

  • Sources. Mill 1 daily CSV export (farmer_name, paddy_weight_kg, moisture_pct, grade, price_mmk, mill_date, intake_time). Mill 2 daily JSON export (supplier_name, weight_kg, moisture_percent, harvest_quality, payment_amount, processing_date). Field mapping from materials/field-mapping.md.
  • Target schema. Staging layer with unified field names. Two source-conform staging models (stg_mill1_daily, stg_mill2_daily) feeding one intermediate model (int_daily_operations).
  • Extraction pattern. This is the decision you are making in this unit.

Step 2: Evaluate full vs incremental for each source

Open materials/incremental-extraction-guide.md. This is a conceptual overview of full vs incremental refresh, watermarks, and the MERGE pattern. Read it before making extraction decisions.

Two strategies. Full refresh drops and reloads everything every run. Incremental refresh loads only records changed since the last run.

Full refresh is simpler. It is always correct -- no deduplication logic, no watermark tracking, no partial-load failures. But it reprocesses all historical data every time. For Kyaw Zin Oo's mills at 200 tonnes daily, the data starts modest but accumulates. After a year, full refresh reloads 365 days of data to pick up one day's changes.

Incremental refresh is efficient. It loads only what is new or changed. But it introduces a class of failure that full refresh avoids: records that changed between watermark reads, clock skew between systems, records updated without their timestamp changing. These failures are silent -- the pipeline completes successfully with wrong data.

For each source, evaluate: What is the current volume? How fast does it grow? Does the source maintain trustworthy timestamps? How expensive is a full reload? The answer can be different per source.

Step 3: Design the watermark strategy

A watermark column tells the pipeline "what is new since last time." Mill 1 has mill_date. Mill 2 has processing_date.

The question is not whether the column exists. The question is whether the source system reliably updates it. If a supervisor corrects yesterday's record and the system keeps the original mill_date, the correction will not appear in an incremental load that uses mill_date as the watermark. The corrected record looks like old data.

Message Kyaw Zin Oo to clarify the correction process. When Mill 2's manager finds an error, does he update the existing record or re-export the entire day's file? The answer shapes your watermark design.

Step 4: Understand MERGE vs INSERT

INSERT adds rows. Every time. If you load the same data twice, you get double the rows. No error, no warning -- just inflated numbers.

MERGE -- also called upsert -- inserts new records and updates existing ones. It uses a key to decide: does this record already exist? If yes, update it. If no, insert it. The result is the same whether MERGE runs once or ten times. That property is called idempotency, and it is the difference between a pipeline that tolerates retries and one that corrupts data on retry.

Run a concrete demonstration. Load a day's Mill 1 data into DuckDB with INSERT. Load it again. Count the rows -- they doubled. Now create a table with a MERGE pattern using a natural key. Load the same data twice. Count the rows -- they stayed the same.

SELECT COUNT(*) FROM mill1_insert_test;
SELECT COUNT(*) FROM mill1_merge_test;

The INSERT table has twice the expected rows. The MERGE table has the correct count.

Step 5: Design the natural key for MERGE

The natural key determines what "the same record" means. Get it wrong and the MERGE either combines records that should be separate or fails to combine records that are corrections of each other.

For Kyaw Zin Oo's data, consider: what uniquely identifies a single paddy intake event? Record ID alone? Mill ID plus farmer plus date? Mill ID plus farmer plus date plus grade?

Too broad a key (just mill_date) merges all records for a day into one. Too narrow a key (including every column) treats a corrected record as a new record, which defeats the purpose.

The right key matches the grain of the data -- one row per intake event per farmer per mill per day. Test it: load the original day 3 data, then load the corrected day 3 data from materials/mill1-day3-corrected.csv. If the MERGE key is correct, the corrected values replace the originals and the row count stays the same.

✓ Check

Check: Load your Mill 1 sample data using INSERT, then load it again. Query the row count. Now load using MERGE with your chosen natural key. Query the row count. Are they different? Which is correct?