Step 1: Load the data into DuckDB
You know what Naledi needs and what the data contains — on paper. Now look at the actual data.
In your Claude Code session, paste this prompt:
Load materials/khumalo-sales.csv into DuckDB and confirm the load: show me the row count, column names, and the first five rows.
DuckDB is a database that runs locally -- no server to start. It handles analytical queries on structured data faster than loading everything into a spreadsheet. Claude will write a Jupyter notebook cell that creates a DuckDB table from the CSV. A notebook is a document that mixes code, output, and notes in one place -- you run a cell, see the result immediately below it, and keep a record of everything you did. The confirmation you are looking for: a row count, a list of column names, and a few sample rows so you can see the data is real and structured.
Check the column names against materials/data-dictionary.md. If any name is different or missing, the load did something unexpected. The sample rows should show recognizable values — order dates in 2024, product types like dining_table and shelving, ZAR amounts in the sale_amount column.
Step 2: Profile every column
Profiling means generating summary statistics for every column -- counts, ranges, distinct values -- so you can see whether the data actually matches what the data dictionary documented. Now ask Claude to profile the full dataset:
Profile every column in the sales table: column types, null counts, distinct values for categorical columns, min/max/mean for numeric columns, and the date range.
DuckDB can summarize a table in one command. The result is a table of column-level statistics — one row per column — showing what the data actually contains.
Step 3: Read the profile against the dictionary
This is the step that matters. The data dictionary says what the columns should contain. The profile tells you what they actually contain. Read them side by side.
Check these specifics:
- Column types. Are dates stored as dates, not strings? Are numeric fields actually numeric?
sale_amountrange. The minimum should be negative — those are refunds. If the minimum is zero or positive, something went wrong with the load.channeldistinct values. Exactly three: retail, commission, online. No misspellings, no extra categories.product_typedistinct values. Six categories matching what Naledi described.- Null counts. The
retailer_commission_ratecolumn should have nulls — it only applies to the retail channel. - Date range. January through December 2024.
A profile where you understand every column is worth more than a clean notebook where the statistics ran and nobody checked what they say.
Step 4: Compute basic statistics
Direct Claude to compute a few specific numbers you can check against the verification targets in the analysis spec:
How many total rows are in the dataset? What is the date range? How many distinct channels are there, and what are they? How many refund rows (negative sale_amount)?
These are not complex queries. The point is to get concrete numbers you can compare against the targets in materials/analysis-spec.md Section 4. Open that file and check each value Claude reports.
Step 5: Verify against the targets
Compare Claude's output against the verification targets:
| What to check | Expected value |
|---|---|
| Total rows | 845 |
| Date range | 2024-01-01 to 2024-12-31 |
| Channel values | commission, online, retail |
| Refund rows | 25 |
If every number matches, the data loaded correctly and you are working with the right dataset. If something does not match, stop. Ask Claude to re-check the load. A wrong row count means missing or duplicate records. A wrong date range means the data was filtered during import. Fix it before moving on — every metric you compute in the next unit depends on this data being right.
This is verification at its simplest: a number from AI, a number from a trusted source, and a comparison. The habit matters more than the complexity. Later projects will not hand you a target for every number. The practice of checking starts here.
Check: Your dataset should have 845 rows. The date range should span 2024-01-01 to 2024-12-31. The channel column should contain exactly three values: {retail, commission, online}.