Statistical Testing Guide
Introduction to Hypothesis Testing
A hypothesis test asks one question: could this pattern be noise?
When Wei says "bookings increased 22%," that is an observation. A hypothesis test asks whether that increase is large enough to be confident it did not happen by chance. Bookings vary naturally -- some months are higher, some lower, even without any campaign. The test tells you whether the observed increase is bigger than what normal variation would produce.
The structure of a test
Every hypothesis test has two competing explanations:
- Null hypothesis (H0): The campaign had no effect. The increase in bookings is consistent with normal seasonal variation.
- Alternative hypothesis (H1): The campaign did have an effect. The increase in bookings is larger than what seasonal variation alone would produce.
The test computes a p-value -- the probability of seeing an increase at least this large if the null hypothesis were true. If the p-value is small, the data is hard to explain without a campaign effect. If the p-value is large, the data is consistent with normal variation.
P-values are probabilities, not verdicts
A p-value of 0.03 means: if there were no campaign effect, you would see an increase this large about 3% of the time. That is unlikely enough to suggest a real effect -- but it is not proof.
A p-value of 0.08 means: if there were no campaign effect, you would see an increase this large about 8% of the time. That is suggestive but not conclusive.
The conventional threshold is alpha = 0.05 -- if the p-value is below 0.05, the result is called "statistically significant." This threshold is a convention, not a law. A p-value of 0.04 and a p-value of 0.06 represent similar evidence. The threshold helps standardize reporting, not thinking.
What "not significant" means
"Not statistically significant" does NOT mean "no effect." It means the data does not provide strong enough evidence to conclude there was an effect. This could be because:
- There truly was no effect
- The effect was real but small, and the dataset was not large enough to detect it
- The data had quality issues that reduced the test's ability to detect an effect
The honest statement is "insufficient evidence to conclude an effect" -- not "the campaign didn't work."
Test Selection Decision Tree
The choice of test depends on the type of data you are testing.
Is your outcome variable continuous (measurements, revenue amounts)?
- Use a t-test to compare means between two groups
- Example: "Is average revenue per patient higher during the campaign period?"
Is your outcome variable binary or a proportion (yes/no, rates)?
- Use a z-test for proportions to compare rates between two groups
- Example: "Is the proportion of new patients (out of total bookings) higher during the campaign period than during the same period last year?"
- This is the correct test for Wei's primary question -- a booking is either campaign-sourced or not
Are you testing whether two categorical variables are associated?
- Use a chi-squared test to test for association
- Example: "Is the campaign effect associated with specific service categories?"
Why test selection matters
AI defaults to a t-test whenever it sees "compare two groups" -- regardless of whether the data is continuous, binary, or categorical. Applying a t-test to binary outcome data (proportions) produces a different p-value than a z-test for proportions. The difference can flip the conclusion from "significant" to "not significant."
For Wei's analysis: the outcome is whether a booking came from a campaign channel or not. That is binary -- a proportion. The z-test for proportions is the correct choice.
Running Tests in Python
Z-test for proportions (statsmodels)
from statsmodels.stats.proportion import proportions_ztest
count = [campaign_new_patients, baseline_new_patients]
nobs = [campaign_total_bookings, baseline_total_bookings]
z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")
Confidence interval for difference in proportions
from statsmodels.stats.proportion import confint_proportions_2indep
ci_low, ci_high = confint_proportions_2indep(
count1=campaign_new_patients, nobs1=campaign_total_bookings,
count2=baseline_new_patients, nobs2=baseline_total_bookings,
method='wald'
)
print(f"95% CI for difference: [{ci_low:.4f}, {ci_high:.4f}]")
Chi-squared test (scipy.stats)
from scipy.stats import chi2_contingency
contingency_table = [[campaign_cosmetic, campaign_general],
[baseline_cosmetic, baseline_general]]
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-squared: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
Interpreting Results
Reading the output
- Z-statistic / Chi-squared statistic: The size of the effect relative to the expected variation. Larger values mean the observed data is further from what you would expect under the null hypothesis.
- P-value: The probability of seeing data this extreme if the null hypothesis were true. Compare to alpha (0.05 by convention).
- Confidence interval: The range of plausible values for the true effect. "The campaign increased new patient proportion by 3-12 percentage points (95% CI)" means: if you repeated this analysis many times, 95% of the intervals would contain the true effect.
Common interpretation mistakes
- Treating p=0.049 as categorically different from p=0.051 -- they represent nearly identical evidence
- Reporting "the campaign didn't work" when the test is not significant -- the correct statement is "the data does not provide strong evidence of a campaign effect"
- Ignoring the confidence interval and reporting only the point estimate -- "22% increase" is less honest than "between 8% and 36% increase (95% CI)"
- Running multiple tests without acknowledging that more tests means more chances for false positives
Reporting with Confidence Intervals
The format
Every statistical finding should include:
- The point estimate (what you measured)
- The confidence interval (the range of plausible values)
- The p-value (how unlikely this result is under the null hypothesis)
- A plain-language interpretation
Examples
Instead of: "New patient bookings increased 22%."
Write: "New patient bookings increased by an estimated 22%, with a 95% confidence interval of [8%, 36%]. This increase is statistically significant (p = 0.019), meaning it is unlikely to be explained by seasonal variation alone."
Instead of: "The campaign had no effect on general dentistry."
Write: "The increase in general dentistry bookings (8%) was not statistically significant (p = 0.34, 95% CI: [-6%, 22%]). The data does not provide sufficient evidence to conclude the campaign affected general dentistry bookings -- the observed increase is consistent with normal seasonal variation."
Charts with uncertainty
When building charts that display statistical results, add error bars or confidence interval bands:
- Bar charts: vertical error bars showing the 95% CI
- Line charts: shaded bands around the line showing the confidence interval
- Always label the confidence level ("95% CI" or "95% confidence interval")
- Add a reference line at zero if showing effect sizes -- bars that cross zero are not significant