Learn by Directing AI

Step 1: Frame the analytical question

Somchai's board compares raw numbers. Koh Samui's average satisfaction score is 4.3. Chiang Mai is 4.2. Bangkok is 3.9. The board looks at these and draws conclusions about which properties are performing well.

The analytical question: are these differences real, or could they be random variation? If you pulled a different sample of guests, would the ranking change?

This is an inferential question. P1 asked "what are the patterns?" (descriptive). P2 asked "who will not show up?" (predictive). This project asks "are these differences real, and how large are they?" Each question type requires different methods, different evaluation, and different communication.

Step 2: Check assumptions before running the test

The standard approach for comparing means across multiple groups is ANOVA. But ANOVA carries assumptions: the data within each group should be roughly normally distributed, the variance should be similar across groups (homoscedasticity), and the observations should be independent.

Direct AI to check these assumptions before running any test. For normality: run a Shapiro-Wilk test on satisfaction scores for each property. For homoscedasticity: run Levene's test. Check each result.

AI commonly selects statistical tests without checking whether the data meets their requirements. The code runs regardless. The p-value looks like a real answer. But if the assumptions are violated, the p-value is unreliable. Checking assumptions before the test -- not after -- is what makes the method appropriate for the data.

Step 3: Choose the appropriate test

If the assumptions hold, run a one-way ANOVA. If they do not -- and with real-world data, at least one often fails -- consider the non-parametric alternative: a Kruskal-Wallis test. It answers the same question ("is there a significant difference somewhere among the groups?") without requiring normal distributions.

When a normality test fails, the response is not to transform the data until it passes. The response is to use a method that does not require normality. AI often defaults to log-transforming data to force normality rather than switching methods, because transformation preserves its existing code structure. The right approach is to change the method.

Direct AI to run the appropriate test based on what the assumption checks told you.

Step 4: Effect sizes alongside p-values

The p-value tells you whether the difference is statistically significant. It does not tell you whether the difference is large enough to matter.

A statistically significant difference of 0.1 points on a 5-point scale is real but probably irrelevant for Somchai's board decisions. An effect size tells you how big the difference actually is.

Direct AI to compute the appropriate effect size: eta-squared for ANOVA, or epsilon-squared for Kruskal-Wallis. These numbers express what proportion of the variation in satisfaction is explained by which property the guest stayed at.

If eta-squared is 0.04, that means the property explains about 4% of the variation in satisfaction. The other 96% comes from other factors -- the season, the room type, individual guest preferences. That is a real but small effect, and Somchai should know that.

Step 5: Pairwise comparisons with correction

The omnibus test (ANOVA or Kruskal-Wallis) tells you "somewhere there is a difference." It does not tell you between which properties.

Direct AI to run pairwise comparisons: Tukey HSD if you used ANOVA, Dunn's test if you used Kruskal-Wallis. These compare each pair of properties.

There is a catch. With five properties, you are running 10 pairwise comparisons. Each comparison has a chance of producing a false positive. Running 10 of them without correction dramatically increases the chance that at least one "significant" result is actually noise.

Multiple comparison correction (Bonferroni, Tukey, or Holm) adjusts for this. AI will happily run all 10 comparisons and report every significant result without correction. Direct it to apply the correction.

Step 6: Explore interaction effects

The property differences might depend on something else. Does the satisfaction gap between Koh Samui and Chiang Mai change across seasons? Beach resorts peak in winter; cultural properties peak in the cool season. The apparent difference might be a seasonal pattern, not a property quality pattern.

Direct AI to test at least one interaction effect. If the interaction is significant, the main effect (property alone) tells an incomplete story. The board should know that Koh Samui leads in the winter months but Chiang Mai catches up in its peak season -- not just that Koh Samui has a higher average.

AI typically tests main effects without exploring interactions unless explicitly directed. The most actionable findings are often in the interactions.

Step 7: Interpret for Somchai

Step back from the statistics. Somchai's board needs to know:

Which property differences are real and which are noise?
How large are the real differences?
Do the differences depend on the season or other factors?

Direct AI to draft a plain-language summary of the inferential findings. The summary should not contain p-values, test statistics, or effect size numbers without plain-language translation. "The satisfaction differences across properties are real but small -- the property explains about 4% of the variation" is more useful than "F(4, 3348) = 8.73, p < 0.001, eta-squared = 0.04."

If an assumption was violated and you switched methods, note that in the methodology memo. "The Shapiro-Wilk test rejected normality for Koh Samui scores, so we used Kruskal-Wallis instead of ANOVA. The conclusion is the same." That sentence tells anyone reading the memo: we checked, we found a problem, we handled it.

✓ Check

Check: Assumptions checked before test. Effect sizes computed. Multiple comparison correction applied. Interaction tested.