Learn by Directing AI

Step 1: The seasonal baseline

The hypothesis test needs two numbers to compare: the rate before the campaign and the rate during the campaign. Start with the baseline.

The seasonal baseline is Oct-Dec of the prior year -- the same months, one year earlier. This controls for the Q4 spike that Wei already knows about. Direct AI to compute it:

Calculate the new patient booking rate (new bookings / total bookings) for Oct-Dec of year 1 across the five original clinics. Exclude Gaoxin entirely.

The result is a proportion -- something like 0.34 or 0.36. This is the rate Wei would expect to see in Q4 without any campaign.

Step 2: The campaign-period rate

Now compute the same proportion for the campaign period:

Calculate the new patient booking rate (new bookings / total bookings) for Oct-Dec of year 2 across the five original clinics. Exclude Gaoxin.

Compare the two rates. The campaign-period rate should be higher. The question is whether it is higher by enough to rule out normal variation.

Step 3: Running the z-test

Open materials/statistical-testing-guide.md and find the "Running Tests in Python" section. The code uses proportions_ztest from statsmodels.

Direct AI to run the test. Be explicit about the constraint -- AI commonly defaults to a t-test whenever it sees "compare two groups," regardless of whether the data is binary or continuous. Specify the test and the reason:

Run a z-test for proportions using proportions_ztest from statsmodels. Compare the new patient booking rate in Oct-Dec year 2 (excluding Gaoxin) against Oct-Dec year 1. Use alternative='two-sided'. Do not use a t-test -- the outcome is a proportion (binary: new or returning), not a continuous measurement.

The output gives two numbers: a z-statistic and a p-value. The z-statistic measures how far the campaign-period rate is from the baseline rate, in units of standard error. The p-value is the probability of seeing a difference this large if the campaign had no effect.

Step 4: Interpreting the p-value

The p-value is not a verdict. It is a probability.

If the p-value is 0.02, that means: if the campaign had no effect at all, you would see an increase this large about 2% of the time. That is unlikely enough to suggest the campaign did something. But it is not proof -- 2% chance events do happen.

If the p-value is below 0.05, the result is conventionally called "statistically significant." If it is above 0.05, the result is "not statistically significant." But a p-value of 0.04 and a p-value of 0.06 represent similar evidence. The threshold is a reporting convention, not a line between truth and falsehood.

Check what AI reports. AI tends to treat the 0.05 threshold as mechanical -- "significant" or "not significant" with no nuance. If the p-value is close to the boundary, the honest interpretation acknowledges the ambiguity.

Step 5: The confidence interval

A p-value tells you whether an effect exists. A confidence interval tells you how large it might be. Wei needs both.

Direct AI to compute the confidence interval for the difference in proportions:

Compute the 95% confidence interval for the difference in new patient booking rates between the campaign period and the baseline period. Use confint_proportions_2indep from statsmodels with method='wald'.

The output is a range -- something like "the campaign increased the new patient rate by 3 to 12 percentage points (95% CI)." That range is the honest version of Wei's "22% increase." It says: the true effect is probably somewhere in this interval. The point estimate is the midpoint, but the range captures the uncertainty.

This is the transition from descriptive to inferential. "Bookings increased 22%" is an observation -- it describes the sample. "The increase was between X% and Y% with 95% confidence" is a claim about the underlying effect, not just the numbers in the dataset.

Step 6: Channel-level tests

Wei wants to know which channels worked. Break the analysis down by booking source.

The campaign had three channels: WeChat ads, KOL partnerships, and the referral bonus. Each can be tested separately -- is the proportion of bookings from that source higher than expected?

For each campaign channel (wechat_ad, kol, referral), compute the new patient booking rate from that channel during Oct-Dec year 2. Compare each to the overall new patient rate during the baseline period. Run a separate z-test for proportions for each channel.

Watch the referral channel closely. The referral source tag appears in both years -- organic referrals existed before the campaign. The campaign added a CNY 200 bonus, but the data cannot distinguish bonus-driven referrals from organic ones. That attribution gap means the referral channel test has less statistical power. The baseline rate is approximate.

The WeChat ad and KOL channels are cleaner -- those sources did not exist before the campaign, so any booking tagged to them is campaign-attributed. But some patients who saw a WeChat ad may have booked by phone and been tagged as "walk-in" instead. The campaign's true effect is likely understated.

About 3% of bookings have null booking_source values. Those are lost observations -- they reduce the effective sample size for channel-level tests. A test that would detect a real effect with complete data might miss it with 3% of observations unattributable.

Step 7: Service category chi-squared test

Wei's campaign promoted all services, but the effect might concentrate in specific categories. This is a question about whether two categorical variables -- campaign period (yes/no) and service category (general, cosmetic, orthodontics, implant) -- are related.

The chi-squared test is the right tool for this. Direct AI to build a contingency table and run the test:

Build a contingency table: rows are campaign period (Oct-Dec year 2) vs baseline (Oct-Dec year 1), columns are service categories. Count new patient bookings only. Exclude Gaoxin. Run a chi-squared test using chi2_contingency from scipy.stats.

The chi-squared result tells you whether the distribution of new patients across service categories changed during the campaign. If it did, follow up: which categories drove the change? The cosmetic dentistry category should show the largest shift.

Step 8: Cross-checking the results

AI produces statistical output that looks authoritative. The numbers are precise. The formatting is clean. But the test selection and interpretation are judgment calls, and AI gets these wrong in predictable ways.

Open a fresh Claude Code session. Paste the key context -- what the data is, what the hypothesis is, what test was run. Then ask it to review:

I ran a z-test for proportions comparing new patient booking rates between Oct-Dec year 2 (campaign period, excluding Gaoxin clinic) and Oct-Dec year 1 (baseline). The outcome is binary: a booking is either from a new patient or not. Review whether this test selection is correct for this data type and hypothesis. Also review whether a two-sided test was appropriate.

The fresh session should confirm the z-test is appropriate for proportion data. If it suggests a t-test instead, that is the pattern -- AI defaulting to "compare two groups means t-test." You now know to check this every time.

✓ Check

Check: You should have a p-value for the primary hypothesis test, a confidence interval for the campaign effect, and a breakdown showing which channels and service categories showed statistically significant effects.

Running the statistical tests