Step 1: Product name mapping
Three systems, three naming conventions, one product catalog. The first job is to map every product name in every source to its canonical name in the product catalog.
Direct AI to build the mapping:
Using the product_catalog.csv as the authoritative list, create a mapping table that matches every distinct product name in taproom_pos.csv, distribution_invoices.csv, and online_orders.csv to its canonical product name. Show me the mapping table when done.
Review every row. AI will get most of them right -- "IPA" maps to "IPA" in the catalog, "BL IPA x24" maps to "IPA", "Bona Lluna IPA 330ml" maps to "IPA". But check the edge cases. Does "Bona Lluna Pale Ale 6-Pack" map to "Pale Ale"? Does AI handle the discontinued products correctly -- "Red Ale" and "Pilsner" appear in historical data but are marked discontinued in the catalog.
If AI makes errors in the fuzzy matching, correct them specifically. "Map 'Bona Lluna Stout Gift Box' to 'Stout' -- it's a gift set of the same beer."
Step 2: Date standardisation
Each source uses a different date format. The taproom uses YYYY-MM-DD. Distribution uses DD-Mon-YYYY (like "15-Jan-2024"). Online uses MM/DD/YYYY.
Direct AI to standardise all dates to YYYY-MM-DD format:
Standardise the date columns across all three sources to YYYY-MM-DD format. For the distribution data (DD-Mon-YYYY) and online data (MM/DD/YYYY), parse and convert. Verify the conversion by showing a few sample dates from each source before and after.
AI commonly makes assumptions about date locale. "03/04/2024" in MM/DD/YYYY is March 4th, not April 3rd. The online data uses US format because WooCommerce defaults to it. Ask AI to show you a known date from the distribution data -- pick one you can verify -- to confirm the parse is correct.
Step 3: Distribution data -- consignment returns
The distribution data has a payment_type column with values "standard" and "consignment." This is one of Carmen's hidden details: standard wholesale means restaurants pay per keg on delivery. Consignment means bars pay after selling and return unsold kegs.
Look at the payment_status column for consignment invoices. Some are "paid", some "pending", and some "returned." Returned consignment kegs are kegs that didn't sell -- they came back. If you count them as revenue, you're overstating what Carmen actually earned.
Decide how to handle this. The options:
- Subtract returned consignment amounts from total distribution revenue
- Exclude all consignment invoices (too aggressive -- most consignment kegs do sell)
- Keep returned consignment as a separate line item for transparency
Document whichever approach you choose in the analysis template's cleaning decisions log. Note how many rows are affected and what the revenue impact is.
Step 4: Online data -- currency handling
The online shop processes orders in EUR and GBP. UK customers pay in GBP, but the unit_price column contains amounts in both currencies without any conversion.
Direct AI to show you the currency split:
How many online orders are in EUR versus GBP? What's the total revenue in each currency?
You need to decide: convert GBP to EUR at a fixed rate (which rate?), or handle them separately, or exclude GBP orders. None of these is neutral. A fixed conversion rate ignores exchange rate fluctuations over the twelve months. Separate handling makes the dashboard more complex. Excluding GBP loses about 8% of online revenue.
Document the decision and the trade-off in the analysis template.
Step 5: Join and validate
Now join the three cleaned sources into a unified dataset. This is where cross-source profiling matters.
Join the three cleaned and standardised data sources into a unified dataset. After the join, show me: total row count, total rows from each source, and total revenue from each source individually versus the unified total.
Check the row count. The unified dataset should have approximately the sum of the three source row counts -- taproom + distribution + online. If it has more, there's a duplication problem from the join. If it has significantly fewer, rows were dropped.
Then check the revenue. Add up the individual source revenues. Compare that to the total revenue from the unified dataset. If they don't match, the difference should be explainable: consignment returns you subtracted, currency conversion adjustments, or rows you excluded.
AI generates joins without checking whether the row count changed. This is the first place to verify. A mismatch between "sum of parts" and "unified total" is information about your definitional choices, not a bug.
Check: What is the total revenue from each source individually? Does the sum of the three match the total from your unified dataset? If not, can you explain the difference?