The Brief
Marco Quispe runs Cumbre Adventures in La Paz, Bolivia. Mountain biking on Death Road, trekking in the Cordillera Real, climbing Huayna Potosi, paragliding over the city. Fifteen employees, seven years in business.
Two months ago, Marco redesigned his online booking page and ran an A/B test: half the visitors saw the old page, half saw the new one. Sixty days of data. About 4,200 visitors. The new page has a higher overall booking rate. But premium trek bookings dropped. His web developer says the experiment worked. His operations manager says it's losing money. Marco is standing in the middle, not sure who's right.
He exported the raw data and needs someone to look at the numbers properly.
Your Role
You're producing an experiment report: what the A/B test actually shows, what it doesn't show, and what Marco should do about it. That means statistical tests with real numbers -- p-values, confidence intervals, effect sizes -- not just "the new page is better."
This time, the brief is deliberately less structured. Marco describes a situation, not a set of questions. You decide what to investigate, in what order, with what methods. The experiment data provides structure -- there are clear variants to compare -- but the analytical framing is yours.
AI computes the statistics. You verify them against provided targets, catch interpretation errors, and direct the analysis. When AI tells you "there is a 97% probability the new page is more effective," you'll need to know why that sentence is wrong.
What's New
Last time, you integrated four data sources in four formats, designed a metric hierarchy with cascading definitions, and structured findings as a professional argument. You used cross-model review to verify analysis.
This time, three things arrive at once. First: experiment data. A/B tests support causal claims, not just descriptions. The statistical methods overlap with what you've used before, but the inferential stakes are higher. Second: you'll connect AI directly to a database via MCP for the first time. AI reading the schema instead of working from your descriptions changes what it can do. Third: the brief doesn't tell you what question to answer. "Is the new page better?" has multiple valid framings, and the one you choose determines what the test actually tests.
The hard part is not the statistics. It's figuring out what question to ask, discovering the confounds that change the interpretation, and communicating an honest answer when the data tells a more complicated story than Marco expects.
Tools
- Python 3.11+ (via Miniconda, "analytics" environment)
- DuckDB (continuing -- now also accessed via MCP)
- DuckDB MCP server (new -- first AI tool connection)
- Jupyter Notebook
- pandas
- scipy.stats
- matplotlib / seaborn
- Metabase (via Docker)
- Docker
- Claude Code (plan mode continuing)
- Git / GitHub
Materials
- A/B test dataset -- CSV export from Cumbre Adventures' booking platform. About 4,200 rows. Every visitor who saw either page version, what they booked, what they paid, how they arrived.
- Data dictionary -- describes the dataset columns and the experiment setup.
- Statistical testing template -- reporting format for experiment results: test setup, metric definition, results, confounds, recommendation.
- Verification targets -- known-good values for the overall conversion rate test. You compare AI's output against these.
- DuckDB MCP config -- configuration file for connecting AI to the database.
- CLAUDE.md -- project governance file with client context, work breakdown, and verification targets.