Learn by Directing AI

Step 1: Frame the prediction question

The inferential analysis told you which differences are real. Now Somchai wants to know what drives guest satisfaction. Is it the property itself? The season? The room type? Something else?

This is a prediction question with an explanatory purpose. You are building a model not to predict individual scores but to identify which factors matter most. The coefficients and feature importances tell Somchai what levers he has.

Step 2: Feature selection as an analytical decision

The joined dataset has many potential features: property name, room type, season (derived from date), length of stay, booking source, rate per night, review platform.

Not all features belong in the model. Direct AI to list the candidates and evaluate each one:

Is it available at the point where Somchai would use the model? (If not, it is leaky.)
Does it provide independent information or is it redundant with another feature?
Could it create a circular prediction? (Using the overall satisfaction score as a feature to predict satisfaction is obviously circular, but subtler forms exist.)

Feature selection is not about finding the combination with the highest score. It is about building a model that answers the right question with features that make sense.

Step 3: Build a baseline model

Direct AI to fit a simple linear regression with the selected features. Before evaluating, also compute the naive baseline: predicting the mean satisfaction score for everyone.

Without a baseline, model performance is uninterpretable. An R-squared of 0.35 sounds modest. But if the baseline R-squared is near zero, the model explains a substantial amount of variation. If the baseline achieves 0.30 already, the model is barely an improvement.

Step 4: Introduce regularization

The dataset may have more features than the data can support. When a model has many features relative to the number of observations, it risks overfitting -- memorizing the training data instead of learning real patterns. The gap between training accuracy and test accuracy is the overfitting signal.

Regularization addresses this. Ridge regression shrinks all coefficients toward zero, keeping all features but reducing their influence. Lasso regression goes further -- it can set some coefficients to exactly zero, effectively removing features from the model.

The choice between Ridge and Lasso is a decision about what you believe: if most features contribute a little, Ridge is appropriate. If you believe only a few features really matter, Lasso's ability to eliminate features is more useful.

Direct AI to fit both Ridge and Lasso models alongside the baseline regression.

Step 5: Cross-validation

A single train/test split gives one estimate of performance. Cross-validation gives a distribution.

Direct AI to run 5-fold cross-validation on each model (baseline, Ridge, Lasso). Report the mean and standard deviation of scores across folds.

The standard deviation matters. If performance varies wildly across folds, the model is unstable -- its results depend heavily on which data it happens to see. AI will report only the mean by default. Direct it to include the standard deviation.

Step 6: Compare and select

Look at the cross-validation results:

Which model performs best?
Is the improvement over the baseline meaningful?
How stable is each model (standard deviation across folds)?

If Lasso zeroed out some features, that is informative. It tells you those features do not contribute to satisfaction prediction once the other features are accounted for. That finding -- what does not matter -- is often as useful as what does.

Step 7: Interpret in business terms

Translate the selected model's feature importances into language Somchai can use.

Direct AI to present the key drivers: which features have the largest coefficients, what direction they point, and what that means for the business. "Season is the strongest driver -- Koh Samui satisfaction peaks in winter months when the beach is at its best. Room type has a moderate effect -- suite and villa guests rate consistently higher. Property name, once you account for season and room type, explains less than you might expect."

Check whether the coefficients make sense in the domain. AI generates technically valid models with substantively absurd coefficients. If a coefficient suggests that higher prices lead to higher satisfaction, consider whether that reflects a real relationship (premium rooms come with better service) or a confound.

✓ Check

Check: Cross-validation with standard deviations. Regularization comparison. Feature importance in business terms. No leaky features.