Step 1: Handle the cancellation data
The booking dataset mixes confirmed and cancelled bookings in a single table. Direct AI to examine the status field and count the breakdown.
About 15% of bookings are cancelled. For the inferential analysis -- testing whether the marketing shift is associated with booking growth -- you need to work with confirmed bookings only. Including cancelled bookings would inflate the total counts and potentially introduce noise if cancellation rates differ by tour type or group size.
Direct AI to filter to confirmed bookings and document the decision: how many records were removed, what percentage, and why. Save this in the methodology memo under "Data Cleaning."
This is a preparation decision, not a trivial filter. If cancellation rates changed around the same time as the marketing shift, excluding them could affect the inference. Direct AI to check whether cancellation rates differ before and after the marketing shift. If they are roughly stable, the filter is clean.
Step 2: Investigate the self-reported marketing channel
The data dictionary flagged that marketing_channel is self-reported. Now dig into what that means for the analysis.
Ask Hassan about how the marketing channel data is collected. Go to the platform and send him a question about it.
Hassan responds quickly: "oh, the booking form asks 'how did you hear about us?' and they pick from a dropdown. It's not exact -- I've had people say Google when I know they clicked our Instagram ad because I can see the referral link. But it's what we have."
This is not a minor data quality issue. Self-reported attribution systematically underestimates the impact of upstream channels like Instagram. Someone sees an Instagram ad, searches on Google, and reports "Google." The marketing_channel field underreports Instagram's actual influence. This means any analysis of channel effectiveness based on this field will understate Instagram and overstate Google.
Document this limitation in the methodology memo under "Attribution Reliability Assessment." Be specific: the limitation constrains what the analysis can say about individual channel effectiveness, but does not prevent the analysis from testing whether the overall shift to digital marketing is associated with booking growth.
Step 3: Run descriptive analysis of seasonal patterns
Direct AI to produce monthly booking trends over the three years, using confirmed bookings only. Ask for a time series plot showing monthly booking counts.
The pattern should be clear: a winter peak from October through April and a summer trough from June through August. This matches Egypt's tourism calendar -- the cooler months attract more visitors. Year-over-year growth is visible across all seasons.
Calculate year-over-year growth rates. Direct AI to compare the same months across years. The growth is not uniform -- some months grow more than others, which is useful information for Hassan's staffing decisions.
These seasonal patterns directly address one of Hassan's five requirements: "show me the seasonal patterns clearly so I can plan my guide staffing." The descriptive analysis is a deliverable, not just a preliminary step.
Step 4: Run descriptive analysis by segment
Direct AI to break down bookings by traveler country, tour type, and group size.
European visitors are the largest segment -- roughly 40% of bookings, with the UK, Germany, and France leading. North American visitors account for about 25%. East Asian visitors about 15%. The rest are spread across other countries.
Look at the tour type breakdown. Private Cultural Tours are the most common. The Luxor Premium packages appear only in the last 12 months of data -- Hassan launched them about a year ago. They carry higher booking values (40,000-80,000 EGP) and have become a meaningful part of the mix.
These segment profiles give Hassan context for his growth story. They also reveal a confounding factor: the Luxor packages launched around the same time as other changes. The regression will need to account for this.
Step 5: Prepare the regression variables
The inferential analysis needs specific variables. Direct AI to create:
- marketing_shift: A binary variable. 0 for bookings before approximately July 2024, 1 for bookings after. This is the key predictor -- it captures the shift from print to digital marketing.
- Seasonal indicators: Month dummies or a peak-season indicator (October through April vs. the rest). These control for the seasonal pattern so the regression does not attribute seasonal variation to the marketing shift.
- Luxor launch indicator: A binary variable for bookings after the Luxor Premium packages launched (approximately January 2025). This controls for the new product line.
The analysis will run at the monthly level -- aggregating confirmed bookings per month -- so direct AI to create a monthly summary dataset with these variables.
Direct AI to check for multicollinearity between these predictors. The marketing_shift and the Luxor launch are correlated in time (both happened during the period), which is a confounding concern. The VIF (variance inflation factor) values will tell you whether this correlation is severe enough to undermine the coefficient estimates.
Step 6: Check statistical assumptions
Before running the regression, direct AI to check the distributions of the dependent variable (monthly bookings) and the residuals from a preliminary model fit.
For an inferential analysis, assumption checking is not optional. The interpretation of coefficients and p-values depends on the assumptions being roughly met:
- Normality of residuals: A Q-Q plot or Shapiro-Wilk test. Moderate departures are tolerable with a sample of 36 months, but extreme skew would require transformation.
- Homoscedasticity: A residuals-versus-fitted-values plot. If the residual spread increases with the fitted values, the standard errors are unreliable and the p-values are misleading.
AI commonly produces assumption check code without interpreting the results. Direct AI to interpret each check explicitly -- not just run the diagnostic, but state whether the assumption is met and what it means for the analysis.
If any assumption fails badly, determine the appropriate adjustment before proceeding to the regression. Document the assumption check results in the methodology memo.
Check: Cancellations separated (~15% removed). Self-reported attribution limitation documented. Seasonal patterns charted -- winter peak visible, year-over-year growth confirmed. Regression variables created. Assumptions checked before the inferential model runs.