Learn by Directing AI
Unit 5

Accounting for the confounds

Step 1: Segment by visitor source

The ad budget shift is a confound: paid ad traffic increased in week 3, and paid ad visitors may behave differently from organic visitors. To isolate the page design effect from the traffic composition effect, filter the data to organic visitors only.

Use the MCP-connected DuckDB to run the segmented analysis:

Using the DuckDB connection, compute the overall booking rate for page A and page B for organic visitors only (visitor_source = 'organic'). Then compute the difference and compare it to the full-sample difference.

The organic-only conversion rate difference should be smaller than the overall difference. Removing the paid ad traffic removes the confound -- what remains is closer to the actual page design effect. The difference between the full-sample number and the organic-only number is the portion of the apparent effect that was driven by the traffic composition change, not the page redesign.

Step 2: First two weeks check

As a robustness check, filter to the first two weeks of the experiment only -- before the ad budget shifted.

Using the DuckDB connection, compute booking rates by page version for visitors in the first 14 days of the experiment only. Compare to the full-period results.

The pattern should be similar to the full period but with wider confidence intervals due to the smaller sample. If the first-two-weeks results show a similar direction but are noisier, that's consistent with a real but smaller effect being amplified by the traffic composition change.

Two different segmentation approaches (organic-only and first-two-weeks) pointing in the same direction strengthens the conclusion. If they pointed in different directions, you would need to investigate further.

Step 3: The language limitation

Ask Marco about which visitors were included in the A/B test. Not which tour types or which visitor sources -- which language versions of the site.

Marco has to think about it: "Huh, you know, I think the test was only on the English page. We have the Spanish and French pages too, but those weren't part of it."

This is a third confound -- but a different kind. The ad budget shift and pricing display difference threaten the test's internal validity (is the observed effect actually caused by the page design?). The language limitation threatens the test's external validity (can we generalize the results to all of Marco's visitors?). Only English-language visitors were tested. Spanish and French visitors all saw the old page.

The test results apply to English-language visitors only. Any recommendation for the Spanish and French pages requires a separate test.

Step 4: Premium trek sample size

Look at how many visitors saw premium trek options on each page version. With roughly 4,200 total visitors split 50/50, and premium treks representing a smaller fraction of total interest, the per-version sample for premium treks may be small.

Using the DuckDB connection, count the total visitors who could have booked a premium trek (huayna_potosi or cordillera_real) on each page version. How large is the sample for each group?

With about 180 visitors per page version for premium treks combined, the test has limited statistical power for this subgroup. A meaningful difference might exist but the sample is too small to detect it reliably. This does not mean the premium trek decline is not real -- it means the test cannot prove it with confidence.

Direct AI to discuss what sample size Marco would need to detect a meaningful difference in premium trek bookings. This is a forward-looking recommendation, not a verdict on the current data.

Step 5: Synthesise the confounds

You have identified three confounds. Each changes the interpretation differently:

  1. Ad budget shift (week 3) -- threatens internal validity. Paid ad visitors increased mid-experiment, inflating the apparent page B advantage. When isolated to organic visitors, the effect shrinks.
  2. Pricing display difference -- likely explains the premium trek decline. The new page hides group discounts that make premium treks attractive. This is a page design issue, but it is specific to the pricing layout, not the overall redesign.
  3. Language limitation -- limits generalizability. Results apply only to English-language visitors. Spanish and French visitors were not tested.

Each confound operates independently. The ad budget shift inflated the overall effect. The pricing display drove the premium trek decline. The language limitation constrains who the results apply to.

The honest summary: the new page probably does improve overall bookings for English-language visitors, but the effect is smaller than the headline number suggests. The premium trek decline is likely caused by the pricing display, not the page design itself.

Step 6: Verify the confound analysis

Use meta-prompting one more time. You've done the segmented analysis and catalogued the confounds. Check your work:

Here's my analysis of the confounding factors in this A/B test: (1) an ad budget increase mid-test changed the traffic composition, (2) a pricing display difference between page versions explains the premium trek decline, (3) the test only covered English-language visitors. What am I missing? What other threats to validity should I consider?

AI may surface additional threats: seasonal effects during the 60-day window, differences in visitor behavior by day of week, whether the randomization was truly random, or whether repeat visitors were counted multiple times. Some of these you can check with the data. Others are limitations you acknowledge in the report.

The verification extends to the interfaces between different parts of the analysis. Does the metric definition you chose in Unit 2 match what the tests computed in Unit 3? Does the organic-only segmentation use the same metric definition? When different parts of an analysis are handled in different sessions or with different tools, the connections between them are where errors hide.

✓ Check

✓ Check: The organic-only conversion rate difference should be smaller than the overall difference (removing the ad traffic effect). The first-two-weeks analysis should show a similar but noisier pattern.