Learn by Directing AI
All materials

incremental-extraction-guide.md

Incremental Extraction Patterns

Full vs incremental refresh

Every data pipeline faces a design choice at the extraction layer: load everything from scratch each time (full refresh), or load only what changed since the last run (incremental).

Full refresh is simpler. It is always correct because there is nothing to track -- you query the source, get all the data, and replace what you had before. The cost is proportional to the total data volume, regardless of how much actually changed.

Incremental refresh is efficient. It loads only new and changed records, which is dramatically cheaper for large or frequently-updating sources. The cost is proportional to the change volume, not the total volume. But it introduces a new class of failure: missed records, duplicate records, stale watermarks, and boundary conditions that don't exist in full refresh.

The choice is not about preference. It is about the source's characteristics: how much data, how often it changes, whether timestamps are reliable, and how corrections are handled.

Watermark columns

A watermark column is the field that tells the pipeline "what's new since last time." Common choices: updated_at, created_at, modified_date, processing_date.

The critical question is trust. Does the source system reliably update this column on every change? A source that modifies a record without updating the timestamp produces records that incremental extraction will never see -- they changed, but the watermark didn't move, so the pipeline doesn't know to pick them up.

Before choosing a watermark column, verify: does this column advance when records are created? Does it advance when records are updated? Are there any operations that change records without touching this column?

The MERGE pattern

In a full refresh pipeline, you can DROP and recreate the target table each time. In an incremental pipeline, you need to add new records and update existing ones without creating duplicates.

MERGE (also called upsert) does both in one operation. Given a natural key that uniquely identifies each record:

  • If the key exists in the target: update the record with new values
  • If the key doesn't exist: insert it as a new record

The result is idempotent -- running the same MERGE twice with the same data produces the same table state. This is the property that makes incremental extraction safe to retry.

The alternative -- INSERT -- is not idempotent. Running INSERT twice with the same data doubles every record. The first run looks fine. The retry creates invisible duplicates. No error. Just wrong numbers.

The natural key selection is a design decision. What uniquely identifies a paddy intake record? The farmer name alone is not enough (the same farmer delivers multiple times). The date alone is not enough (multiple farmers deliver on the same day). The combination of mill + farmer + date + grade might work -- but only if a farmer never delivers the same grade twice in one day. The grain decision determines which records merge and which stay separate.

When to choose which

Factor Favors full refresh Favors incremental
Data volume Small (< 10K rows) Large (> 100K rows)
Source timestamp reliability Unreliable or absent Reliable and consistent
Correction patterns Corrections overwrite originals Corrections are additive
Cost sensitivity Low (local, cheap storage) High (cloud, per-query pricing)
Freshness requirements Daily or less frequent Hourly or more frequent
Complexity tolerance Low -- prefer simplicity Can handle watermark monitoring

Some sources warrant full refresh indefinitely. Small reference tables, lookup values, and sources with unreliable timestamps are safer and cheaper to reload completely. Others must be incremental from the start because the data volume makes full refresh prohibitively expensive or slow.

This is a per-source decision, not a per-pipeline decision. A single pipeline can have some sources on full refresh and others on incremental.