Learn by Directing AI

Step 1: Product name mapping

The three data sources call the same products by different names. Before you can combine them, you need a mapping from every variant name to the canonical name in the product catalog.

Direct AI to create a mapping: "Compare the distinct product names in retail_sales, online_orders, and wholesale_invoices against the canonical names in product-catalog.csv. Create a mapping table showing each source name, which source it comes from, and the canonical name it maps to."

Review what AI produces. Most matches will be obvious -- "Sabah Gold (100g)" maps to "Sabah Gold Loose 100g". Some will be ambiguous. "SG Loose 100" could be "Sabah Gold Loose 100g" or "Sabah Gold Loose 50g" depending on context. AI will guess. Your job is to check the guess.

Look at the quantities and prices alongside the names. A variant selling at MYR 30-35 per unit is probably the 100g version, not the 50g. Use the catalog's cost_per_unit_myr as a cross-check. If AI mapped a name to the wrong product, fix it.

Step 2: Parse the wholesale data

The wholesale "items" column contains free text: "Sabah Gold x20, Rainforest Blend x10". Each invoice bundles multiple products into a single text field. You need to extract individual product lines with quantities.

Direct AI to parse the items column: "For each wholesale invoice, parse the items text into individual rows -- one row per product. Extract the product name and quantity from each entry. Then map those product names to the canonical catalog names using the mapping from Step 1."

Check the parsing output. AI commonly handles the "Product x20" format well but may struggle with inconsistencies -- "Highland Res (250g) x5" versus "Highland Res 250g x 5" versus "Highland Reserve 250g x5". Review a sample of 10-15 parsed invoices to confirm the extraction is correct.

The wholesale data also has 3-5 invoices with null items columns -- summary invoices with just a total amount and no line-item breakdown. You need to decide what to do with those. Dropping them loses real revenue. Keeping them means you have invoice totals you can't attribute to specific products. Document whichever choice you make and why.

Step 3: Handle nulls and cleaning decisions

Each source has scattered nulls. The retail data has nulls in the quantity and unit_price columns. The online data has nulls in customer_region. Your profiles from Unit 1 already told you where the nulls are and how many.

Every cleaning decision is a trade-off. Dropping rows with null quantities reduces your dataset. Imputing with the median flattens variation. Keeping nulls forces every downstream aggregation to handle them. There is no neutral choice -- AI will pick one without telling you unless you ask.

Direct AI to handle each source's nulls, one at a time. For each decision: "How many rows have nulls in [column]? What are the possible approaches? Implement [chosen approach] and document the decision."

Watch what AI actually does. AI cleaning code can silently drop rows -- errors='coerce' followed by dropna() removes rows without reporting how many were lost. Check row counts before and after each cleaning step.

Not every "duplicate" is an error. A customer buying the same product twice on the same day is a real event, not a data quality issue. If AI suggests deduplicating, ask: "What defines a true duplicate in this dataset?"

Step 4: Unify the datasets

With names mapped, wholesale parsed, and nulls handled, combine everything into a single dataset.

Direct AI to create a unified table with a common schema: date, product (canonical name), channel, quantity, unit_price, total_amount, cost_per_unit (from catalog).

The three sources have different date formats -- YYYY-MM-DD for retail, MM/DD/YYYY for online, DD-MMM-YYYY for wholesale. All need to resolve to the same format in the unified table.

The wholesale data has invoice totals but no per-unit prices. You need to compute the implied unit price from the invoice total and the quantity parsed in Step 2. This is an assumption worth documenting -- the implied price may not exactly match what Farid charged, since he mentioned tiers exist (though you may not know that yet).

After unifying, check: do the row counts from each cleaned source sum to the unified total? Are there any product names in the unified table that don't appear in the product catalog?

Step 5: Self-review

This is a new verification technique. Instead of checking against a provided target number, you direct AI to review its own work against a specific reference.

Direct AI: "Review the unified dataset. Check that every product name in the unified table matches a canonical name in the product catalog. List any product names that don't match. Then check that the total row count equals the sum of retail, online, and wholesale rows after cleaning. Report any discrepancy."

The prompt matters. "Review the unified dataset" produces generic reassurance -- "the data looks clean." "Check every product name against the product catalog and list any that don't match" produces a specific, verifiable finding. The difference is whether you gave AI something concrete to check against.

This technique -- directing AI to compare its output against a specific reference document -- is the first verification act you design yourself. The provided targets from P1 still work where they exist. Self-review covers the gaps where no target was provided.

✓ Check

Check: Your unified dataset should contain rows from all three sources. Sum the row counts from retail, online, and wholesale after cleaning -- does the total match your unified table? Are there any product names in the unified table that don't appear in the product catalog?

Clean and unify

Step 1: Product name mapping

Step 2: Parse the wholesale data

Step 3: Handle nulls and cleaning decisions

Step 4: Unify the datasets

Step 5: Self-review