Learn by Directing AI
Unit 3

Running the experiment tests

Step 1: Choose the right test

The outcome variable is binary: each visitor either booked (true) or did not (false). The test compares proportions between two independent groups (page A visitors and page B visitors).

The statistical test type is not specified in the brief. You choose based on the data. For comparing two proportions from independent groups, the appropriate tests are a z-test for two proportions or a chi-squared test for independence. Both are valid and produce equivalent results for two-group comparisons.

AI commonly defaults to the wrong test type for binary data. If you say "compare these two groups," AI may apply a t-test -- which is designed for continuous data, not proportions. The result will be a number, and it will look plausible, but it's the wrong test applied to the wrong data type. Specify the test explicitly.

Direct AI to run the overall conversion rate test:

Run a z-test for two proportions comparing the booking rate between page A and page B visitors. Use booking_completed as the outcome. Report: z-statistic, p-value (two-tailed), 95% confidence interval for the difference in proportions, and Cohen's h effect size.

Step 2: Verify against targets

Open materials/verification-targets.md. It contains known-good values for the overall test: the expected p-value, confidence interval, and effect size, computed from the actual dataset.

Compare AI's output against these targets. The p-value should match within rounding. The CI bounds should match within 0.1 percentage points. The effect size should match within 0.01.

If the numbers don't match, check three things: whether AI used the correct test type (z-test for proportions, not a t-test), whether AI applied the correct metric definition (booking_completed = true / total visitors per page version), and whether AI included all rows without accidental filtering.

This verification against provided targets is practice. In future projects, you won't have targets. The habit of checking AI's statistical output starts here, when you can confirm the right answer.

Step 3: Catch the p-value framing

Read AI's interpretation of the result carefully. AI computes p-values correctly but often explains them incorrectly.

The correct interpretation: "If there were no real difference between the pages, the probability of observing a difference this large or larger is [p-value]. Since this probability is very small, we reject the null hypothesis of no difference."

AI may instead say something like "there is a 97% probability that the new page is more effective" or "we can be 97% confident the treatment works." This is wrong. A p-value is not the probability that the hypothesis is true. It is the probability of the observed data under the assumption that there is no difference. These sound similar but mean fundamentally different things.

If AI frames it correctly, good. If it doesn't, correct the interpretation explicitly. This matters because the framing shapes the recommendation to Marco. "97% probability the new page is better" sounds conclusive. "The observed difference would be very unlikely if the pages were equal" is accurate and appropriately cautious.

Step 4: Per-tour-type tests

Run the same test for each tour type separately:

Run z-tests for two proportions comparing page A vs page B booking rates for each tour type separately: death_road, premium treks (huayna_potosi and cordillera_real combined), and paragliding. Report the same metrics for each: z-statistic, p-value, 95% CI, and Cohen's h.

The results will show different stories for different tour types. Death Road bookings should show a significant positive effect for page B. Premium treks should show a significant negative effect. Paragliding should show no significant difference.

This is why segmented analysis matters. The overall test says "the new page is better." The per-tour-type test says "better for standard tours, worse for premium treks." Marco's web developer and operations manager are both right -- they are just looking at different slices of the same data.

Step 5: Effect size

The p-value tells you whether the difference is statistically significant. Effect size tells you whether the difference is practically significant.

A confidence interval communicates both the direction and the plausible magnitude of the effect. "The conversion rate changed by between 0.8 and 3.2 percentage points (95% CI)" says more than "p = 0.03." The p-value says something happened. The confidence interval says how much.

Direct AI to explain what the effect size means in practical terms for Marco's business. A 3 percentage point increase in Death Road bookings, applied to his traffic volume, translates to a specific number of additional bookings per month. A 7 percentage point decrease in premium trek bookings translates to lost revenue at the higher price point. Both numbers matter for the recommendation.

AI tends to lead with the p-value and bury the effect size. Check that the output gives both equal weight.

Step 6: Meta-prompting for validity

You've computed the tests and verified the numbers. Now check what might make the results misleading. This is meta-prompting -- using AI to help design verification for analytical territory you haven't navigated before.

I've computed A/B test results comparing two booking page versions. The overall test shows a significant positive effect for the new page, but premium trek bookings show a significant negative effect. Help me figure out what could make these test results invalid or misleading. What threats to validity should I consider?

AI should surface several potential threats: changes in traffic composition during the test period, differences in how the two page versions display information, whether the sample size for premium treks is large enough to support the conclusion, and whether other variables changed alongside the page redesign.

Some of these you already know about (the pricing display confound from Unit 2). Others are new leads. The traffic composition question leads directly to the next unit, where you connect AI to the database and discover what happened to Marco's ad budget.

✓ Check

✓ Check: The overall test p-value should match the verification target (within rounding). The confidence interval should match. AI's initial p-value interpretation should be checked -- if it frames it as a probability the hypothesis is true, the student catches it.