Learn by Directing AI
Unit 1

The brief and the data

Step 1: Farid's email

Farid Hassan runs Daun & Co, an organic tea company in Sabah, Malaysia. He grows specialty teas in Ranau and sources from certified farms in Cameron Highlands. Three sales channels: a retail shop in Kota Kinabalu, wholesale to cafes and hotels, and an online store shipping across Southeast Asia.

Revenue has been flat for two years. He suspects wholesale margins are thin and online growth is masking a retail decline. He has data from three systems -- POS, Shopify, and a manually maintained spreadsheet -- but the product names are inconsistent across all three. He tried to reconcile them himself and gave up.

Open the chat with Farid. Read his email. Five requirements:

  1. Combine the three data sources into one clean dataset
  2. Identify which products are profitable by channel
  3. Show channel trends over the fourteen months
  4. Recommend products to consider discontinuing
  5. Document every data quality issue found

Farid is methodical and precise. He responds during business hours in structured paragraphs. He expects the same precision in analysis that he applies to tea processing.

Step 2: Project setup

Open a terminal and start Claude Code:

cd ~/dev
claude

Paste this prompt:

Create the folder ~/dev/analytics/p2. Download the project materials from https://learnbydirectingai.dev/materials/analytics/p2/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.

Claude creates the folder, downloads the materials, and reads CLAUDE.md. That file describes the client, the deliverables, the tech stack, and the work breakdown. It is the project governance file -- every metric, every file reference, every verification step points back to it.

Once Claude confirms it has read CLAUDE.md, you are set up.

Step 3: The product catalog

Open materials/product-catalog.csv. This is Farid's authoritative product list -- standard names, categories, weights, and cost-per-unit in MYR.

Nineteen products across three categories: loose leaf, blended, and gift sets. The names in this file are canonical. "Sabah Gold Loose 100g" is the official name. The POS system might call it "Sabah Gold (100g)". Shopify might call it "Sabah Gold - 100g Pack". The wholesale spreadsheet might call it "Sabah Gold 100g".

Notice the cost_per_unit_myr column. These costs will matter when you define profitability. Also notice the status column -- a few products are discontinued but still appear in historical sales data.

Step 4: Profile the three sources

Direct AI to load all three data files into DuckDB and profile each one separately. Don't ask for one big analysis -- break it into focused requests.

Start with: "Load materials/retail-sales.csv into DuckDB and profile it -- row count, column names and types, value ranges, null counts, and distinct values for key columns."

Then do the same for materials/online-orders.csv and materials/wholesale-invoices.xlsx.

For each source, check: how many rows? What columns exist? What date range does it cover? How many distinct product names appear? Where are the nulls?

The sequence matters. Profiling each source separately lets you compare them afterward. Asking AI to "load and profile all three files" produces a single summary where source-specific details get lost.

Step 5: Compare and identify inconsistencies

Now you have three profiles. Compare them.

The product names are different across systems. The date formats are different -- retail uses YYYY-MM-DD, online uses MM/DD/YYYY, wholesale uses DD-MMM-YYYY. The wholesale data has an "items" column with free text like "Sabah Gold x20, Rainforest Blend x10" instead of structured product and quantity fields.

Count the distinct product names in each source and compare against the product catalog. The retail data will have more distinct names than the catalog -- because the same product appears under multiple variants. That mismatch is the first signal that reconciliation is needed before any analysis.

Some questions to check: Does every source cover the same fourteen-month period? Are there products in the sales data that don't appear in the catalog at all? How many null values are in each source and in which columns?

✓ Check

Check: How many rows are in your retail_sales table? How many in online_orders? How many distinct product names appear in retail_sales versus the product catalog -- do the counts match?