Step 1: Open the methodology memo template
Open materials/methodology-memo-template.md. This is a document you fill in as you work -- not a pre-work exercise. It has seven sections: Data Sources, Join Strategy, Analytical Questions, Methods, Assumptions Checked, Limitations, Key Findings.
Start by filling in Section 1 (Data Sources) with what you learned from profiling. List each source, its shape, what it contains, and the quality issues you spotted.
Step 2: What DuckDB is and why it matters here
You have three data sources that need to be combined. pandas can do this, but SQL makes join logic explicit and auditable -- you can read a SQL query and see exactly what was matched, what was dropped, and why.
DuckDB is a tool that runs SQL queries on local files without a database server. You point it at a CSV file and query it as if it were a table. In Jupyter, it works through the duckdb Python package.
Direct AI to install DuckDB in your Jupyter environment and verify it can load one of the CSV files. A simple query -- select the first 5 rows from bookings -- confirms everything works.
Step 3: Design the join strategy
Before writing any join code, think about the plan. You need to connect bookings, reviews, and revenue into one dataset. Ask yourself:
- Which sources to join first?
- On what keys? Property name and time period are the natural candidates.
- What happens to records that exist in one source but not another?
Direct AI to propose a join plan. It will suggest the obvious approach -- join on property name and date. That is a reasonable starting point, but the details matter. Revenue is monthly while bookings and reviews are daily. Dates might need to be aggregated before the revenue join works.
Step 4: The first join -- bookings and reviews
Direct AI to write a SQL query (via DuckDB) that joins bookings and reviews on property name and date.
Before you trust the result, check the row count. How many rows did each source have? How many rows does the joined result have? If the joined dataset is much larger than either source, that is a signal -- it means the join key was not unique and you have a many-to-many join inflating the dataset with duplicate records.
AI commonly gets join logic wrong in ways that look correct. It joins on the obvious key without checking whether the key produces a clean match. Check the row counts before and after. Check for duplicates. Check for records that did not match.
Step 5: Catch the date format issue
Look closely at the joined results. Something is off with the dates.
The bookings data stores dates as DD/MM/YYYY. The reviews data stores dates as MM/DD/YYYY. For any date where the day is 13 or higher, the formats are obviously different and the join will fail to match (or will not attempt to match). But for dates between the 1st and the 12th, both formats look the same -- 05/03/2025 could be March 5th or May 3rd. Those records will silently mismatch.
If you did not catch this during profiling, the row counts or a spot-check of joined records will reveal the problem. Direct AI to diagnose and fix the format discrepancy -- standardize all dates to a consistent format before joining.
This is the kind of issue that AI will not catch on its own. The join code runs without errors regardless of whether the dates are interpreted correctly. Only domain reasoning -- knowing that different systems often store dates differently -- catches it.
Step 6: The second join -- adding revenue
Now join the booking-review dataset with the monthly revenue reports.
Revenue data is organized by property and month, not by individual booking. You will need to aggregate or match at the monthly level.
Revenue has its own quality issue: each property manager uses different category names. "Room revenue" in one source is "accommodation income" in another and "lodging" in a third. One property includes breakfast revenue in room revenue while others separate it. AI will merge these columns without noticing the inconsistency.
Direct AI to resolve the naming and scope inconsistencies before completing the join. Standardize the revenue categories so comparisons across properties are fair.
Step 7: Document the join strategy
Fill in Section 2 of the methodology memo: what keys were used, what matching logic was applied, how many records matched and did not match, and how the date format and naming discrepancies were resolved.
This documentation is not overhead. Three analytical steps from now, if a result looks suspicious, this is where you trace it back. Without it, you would need to re-examine every preparation decision to figure out what went wrong.
Check: Row count plausible after each join. No date mismatches. Revenue categories consistent. Methodology memo updated.