Learn by Directing AI

Step 1: Review what Assel needs

Before you design anything, go back to the pipeline spec. Assel needs to see spoilage events alongside weather conditions for the same period. She needs storage utilization by bin, grain type, and farmer. She needs this as a daily update.

Those requirements determine the schema. A weekly aggregation would average out the temperature swings that cause spoilage. A monthly view would be useless -- Assel said it herself: spoilage happens in days, not months.

Step 2: Ask AI for a schema proposal

Direct Claude Code to propose a schema based on the source profiles you built in Unit 2 and the requirements from the pipeline spec. Ask for: staging tables (one per data source), a fact table, and any dimension tables. For each table, ask Claude to state the grain -- what one row represents.

The grain decision is the most important design choice in this project. It determines every metric that can be calculated downstream. If the fact table has one row per bin per day, Assel can see exactly which bins experienced spoilage on which dates and what the weather was. If the grain is coarser -- weekly, or at the elevator level -- that detail is gone.

Step 3: Evaluate the proposal

Review what Claude proposed. Check specifically:

Grain: Does the proposed grain support daily bin-level analysis? Assel needs to see individual bin conditions correlated with specific weather days. If Claude proposed a more aggregated grain, push back.

Normalization: Did Claude propose separate dimension tables for grain types or quality statuses? There are three grain types (wheat, barley, flax) and three quality statuses (good, degraded, spoilage). A separate lookup table for these adds join complexity without analytical value. In an OLAP warehouse, a column with a handful of distinct values works better as a column on the fact table than as a foreign key to a dimension. AI tends toward over-normalization because its training includes more transactional database examples than analytical ones.

Source handling: Does the schema have one staging table per source? Storage and weather data have completely different structures. Mixing them into a single staging table would force them into a shared schema that fits neither.

Step 4: Make the grain decision

Decide: what does one row represent in your main fact table?

The grain should be one row per bin per day -- a storage reading joined with that day's weather conditions. This serves every query Assel needs: per-bin spoilage analysis, weather correlation, utilization trends over time.

Write down the decision. This is the first real schema design choice you have made. In P1, the schema was provided. Now it is yours.

Step 5: Define the staging layer

You need two staging tables:

stg_storage: Storage readings from both elevators. Standardize the bin IDs so both formats (A-001, B1-01) are consistent and include an elevator identifier. Cast types. Rename columns to a consistent convention.
stg_weather: Weather data from the API. Convert timestamps from UTC to UTC+6 (Kazakhstan time). Cast types. Rename columns to match your naming convention.

The timezone conversion goes here, in staging, not in the mart. Staging conforms data to a shared standard. By the time data reaches the mart, storage dates and weather dates should be in the same timezone and ready to join.

Step 6: Define the mart layer

The mart table joins storage readings with weather conditions on date. Each storage reading for a given day gets paired with that day's temperature, humidity, and precipitation.

This is where the grain decision becomes concrete. One row per bin per day, with both storage fields (bin ID, grain type, moisture, quality status) and weather fields (temperature min/max/average, humidity, precipitation) on the same row. Assel queries this table directly.