Learn by Directing AI
All materials

schema.md

Schema Documentation — Mel do Sofala Pipeline

Why layers exist

A data warehouse organizes data into layers because raw data and consumer-ready data serve different purposes. If analysts query raw data directly, every analyst writes their own cleaning logic — their own date parsing, their own deduplication, their own revenue calculation. They will not agree. Different spreadsheets of "total revenue" circulate, and nobody knows which one is right.

Layers solve this. The staging layer conforms raw data: correct types, consistent names, standardized units. It does not interpret or transform — it makes the data trustworthy. The mart layer applies business logic: deduplication, computed columns, aggregations. It serves the consumer — Carlos, the buyers, anyone who needs answers.

The staging layer preserves source fidelity. The mart layer serves business questions. Skipping staging means business logic operates on unverified data.

Naming conventions

All tables and columns use snake_case.

  • stg_ prefix — staging tables. Cleaned and conformed, no business logic. Example: stg_harvests.
  • fct_ prefix — fact tables (mart layer). Business events with measures. Example: fct_harvests.
  • dim_ prefix — dimension tables (mart layer). Descriptive attributes. Example: dim_beekeepers. Not used in this project, but mentioned for context — you will encounter dimension tables in later projects.

Column names describe what they contain: weight_kg (not weight or w), collection_date (not date or d), price_per_kg (not price).

Staging table: stg_harvests

One row per source record. No deduplication — if the source has duplicates, staging preserves them. This is intentional: staging reflects what the source contained, not what it should have contained.

Column Type Description
beekeeper_name VARCHAR Name of the beekeeper who delivered the honey
collection_date DATE Date of the harvest delivery
weight_kg DECIMAL(10,2) Weight in kilograms (standardized from source — pound values converted)
quality_grade VARCHAR Quality grade assigned at collection: A, B+, B, or C
price_per_kg DECIMAL(10,2) Price paid per kilogram in Mozambican Metical (MZN)
collection_point VARCHAR Name of the collection point (derived from source filename)
source_file VARCHAR Original CSV filename — for traceability back to Carlos's spreadsheets
loaded_at TIMESTAMP When the record was loaded into the database

Notes:

  • weight_kg must be in kilograms regardless of the source unit. Three collection points (Buzi, Marromeu) record weights in pounds — convert using 1 lb = 0.453592 kg.
  • collection_point is derived from the filename, not from a column in the CSV. The filename encodes the collection point and quarter.
  • quality_grade accepts A, B+, B, or C. Older files contain only A, B, C (the B+ grade was introduced about 8 months ago).

Mart table: fct_harvests

One row per unique delivery — deduplicated on the composite key (beekeeper_name, collection_date, collection_point). Exact duplicate rows (same beekeeper, same date, same collection point, same values) are data entry errors and collapse to one record.

Column Type Description
beekeeper_name VARCHAR Name of the beekeeper
collection_date DATE Date of the harvest delivery
weight_kg DECIMAL(10,2) Weight in kilograms
quality_grade VARCHAR Quality grade: A, B+, B, or C
price_per_kg DECIMAL(10,2) Price paid per kilogram (MZN)
collection_point VARCHAR Collection point name
total_value DECIMAL(12,2) Computed: weight_kg * price_per_kg
source_file VARCHAR Original CSV filename
loaded_at TIMESTAMP When the record was loaded

Deduplication key: (beekeeper_name, collection_date, collection_point). This is a composite key — all three columns together identify a unique delivery. A single beekeeper can deliver on multiple dates (common — most active beekeepers deliver 2-5 times per quarter). The same beekeeper can also deliver to different collection points. Only when all three match is it a duplicate.

Why not deduplicate on beekeeper name alone? Because that would collapse all of a beekeeper's deliveries into one record. A beekeeper who delivered three times in a quarter would appear only once. The row count would drop dramatically, and the revenue totals would be wrong.

Design notes

Why deduplication happens at the mart layer, not staging: Staging preserves source fidelity. If a row appears twice in the source CSV, it appears twice in staging. The mart layer is where we decide what counts as a "real" delivery versus a data entry error. This separation means you can always go back to staging to see exactly what the source contained.

Why source_file is tracked: Carlos's original spreadsheets are the source of truth. If a buyer asks "where did this batch come from?" the answer traces back through source_file to the original collection point spreadsheet. Traceability is a business requirement, not a technical convenience.