Statistical Testing Guide

Introduction to Hypothesis Testing

A hypothesis test asks one question: could this pattern be noise?

When Wei says "bookings increased 22%," that is an observation. A hypothesis test asks whether that increase is large enough to be confident it did not happen by chance. Bookings vary naturally -- some months are higher, some lower, even without any campaign. The test tells you whether the observed increase is bigger than what normal variation would produce.

The structure of a test

Every hypothesis test has two competing explanations:

Null hypothesis (H0): The campaign had no effect. The increase in bookings is consistent with normal seasonal variation.
Alternative hypothesis (H1): The campaign did have an effect. The increase in bookings is larger than what seasonal variation alone would produce.

The test computes a p-value -- the probability of seeing an increase at least this large if the null hypothesis were true. If the p-value is small, the data is hard to explain without a campaign effect. If the p-value is large, the data is consistent with normal variation.

P-values are probabilities, not verdicts

A p-value of 0.03 means: if there were no campaign effect, you would see an increase this large about 3% of the time. That is unlikely enough to suggest a real effect -- but it is not proof.

A p-value of 0.08 means: if there were no campaign effect, you would see an increase this large about 8% of the time. That is suggestive but not conclusive.

The conventional threshold is alpha = 0.05 -- if the p-value is below 0.05, the result is called "statistically significant." This threshold is a convention, not a law. A p-value of 0.04 and a p-value of 0.06 represent similar evidence. The threshold helps standardize reporting, not thinking.

What "not significant" means

"Not statistically significant" does NOT mean "no effect." It means the data does not provide strong enough evidence to conclude there was an effect. This could be because:

There truly was no effect
The effect was real but small, and the dataset was not large enough to detect it
The data had quality issues that reduced the test's ability to detect an effect

The honest statement is "insufficient evidence to conclude an effect" -- not "the campaign didn't work."

Test Selection Decision Tree

The choice of test depends on the type of data you are testing.

Is your outcome variable continuous (measurements, revenue amounts)?

Use a t-test to compare means between two groups
Example: "Is average revenue per patient higher during the campaign period?"

Is your outcome variable binary or a proportion (yes/no, rates)?

Use a z-test for proportions to compare rates between two groups
Example: "Is the proportion of new patients (out of total bookings) higher during the campaign period than during the same period last year?"
This is the correct test for Wei's primary question -- a booking is either campaign-sourced or not

Are you testing whether two categorical variables are associated?

Use a chi-squared test to test for association
Example: "Is the campaign effect associated with specific service categories?"

Why test selection matters

AI defaults to a t-test whenever it sees "compare two groups" -- regardless of whether the data is continuous, binary, or categorical. Applying a t-test to binary outcome data (proportions) produces a different p-value than a z-test for proportions. The difference can flip the conclusion from "significant" to "not significant."

For Wei's analysis: the outcome is whether a booking came from a campaign channel or not. That is binary -- a proportion. The z-test for proportions is the correct choice.

Running Tests in Python

Z-test for proportions (statsmodels)

from statsmodels.stats.proportion import proportions_ztest

count = [campaign_new_patients, baseline_new_patients]
nobs = [campaign_total_bookings, baseline_total_bookings]

z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

Confidence interval for difference in proportions

from statsmodels.stats.proportion import confint_proportions_2indep

ci_low, ci_high = confint_proportions_2indep(
    count1=campaign_new_patients, nobs1=campaign_total_bookings,
    count2=baseline_new_patients, nobs2=baseline_total_bookings,
    method='wald'
)
print(f"95% CI for difference: [{ci_low:.4f}, {ci_high:.4f}]")

Chi-squared test (scipy.stats)

from scipy.stats import chi2_contingency

contingency_table = [[campaign_cosmetic, campaign_general],
                     [baseline_cosmetic, baseline_general]]

chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-squared: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")

Interpreting Results

Reading the output

Z-statistic / Chi-squared statistic: The size of the effect relative to the expected variation. Larger values mean the observed data is further from what you would expect under the null hypothesis.
P-value: The probability of seeing data this extreme if the null hypothesis were true. Compare to alpha (0.05 by convention).
Confidence interval: The range of plausible values for the true effect. "The campaign increased new patient proportion by 3-12 percentage points (95% CI)" means: if you repeated this analysis many times, 95% of the intervals would contain the true effect.

Common interpretation mistakes

Treating p=0.049 as categorically different from p=0.051 -- they represent nearly identical evidence
Reporting "the campaign didn't work" when the test is not significant -- the correct statement is "the data does not provide strong evidence of a campaign effect"
Ignoring the confidence interval and reporting only the point estimate -- "22% increase" is less honest than "between 8% and 36% increase (95% CI)"
Running multiple tests without acknowledging that more tests means more chances for false positives

Reporting with Confidence Intervals

The format

Every statistical finding should include:

The point estimate (what you measured)
The confidence interval (the range of plausible values)
The p-value (how unlikely this result is under the null hypothesis)
A plain-language interpretation

Examples

Instead of: "New patient bookings increased 22%."

Write: "New patient bookings increased by an estimated 22%, with a 95% confidence interval of [8%, 36%]. This increase is statistically significant (p = 0.019), meaning it is unlikely to be explained by seasonal variation alone."

Instead of: "The campaign had no effect on general dentistry."

Write: "The increase in general dentistry bookings (8%) was not statistically significant (p = 0.34, 95% CI: [-6%, 22%]). The data does not provide sufficient evidence to conclude the campaign affected general dentistry bookings -- the observed increase is consistent with normal seasonal variation."

Charts with uncertainty

When building charts that display statistical results, add error bars or confidence interval bands:

Bar charts: vertical error bars showing the 95% CI
Line charts: shaded bands around the line showing the confidence interval
Always label the confidence level ("95% CI" or "95% confidence interval")
Add a reference line at zero if showing effect sizes -- bars that cross zero are not significant