Step 1: Review the Preprocessing Tickets
Open materials/tickets.md and find the data preparation tickets (T08-T13). The tasks: analyze distributions before choosing imputation, decide on encoding for each categorical, implement preprocessing, run AI self-review, stratified split, and document your decisions.
In P1, the tickets told you what to do -- "impute missing values, encode categoricals, scale numerics." This time the tickets list the tasks but not the decisions. You examine the data and decide.
Step 2: Analyze Distributions for Imputation
Before choosing how to handle missing values, you need to know what the distributions look like. Direct Claude to show the distribution of each column with missing values: monthly_charges, total_charges, and complaints_count.
Something like: "For each column with missing values in the dataset, show me the distribution -- histogram shape, mean, median, skewness. I need to decide on imputation strategy and the distribution determines what's appropriate."
Mean imputation pulls missing values toward the center. If the distribution is skewed -- and billing data often is -- the mean is not the center. It's pulled toward the tail. Imputing with the mean on right-skewed data manufactures values that are higher than most real values. The median is more robust for skewed distributions.
Look at what Claude shows you. complaints_count is likely right-skewed (most subscribers have few complaints, some have many). monthly_charges may have moderate skew. Choose your imputation strategy for each based on what the distribution actually looks like.
Step 3: Decide on Encoding
Open materials/data-dictionary-v2.md. For each categorical column, determine whether it's nominal or ordinal.
plan_type (Basic, Standard, Premium) -- is there a natural order? Yes. Basic < Standard < Premium. Ordinal encoding preserves that.
payment_method (Bank Transfer, Credit Card, Electronic Check, Mailed Check) -- is there a natural order? No. These are just categories. One-hot encoding treats them as distinct, which they are.
contract_type (Month-to-month, One year, Two year) -- natural order? Yes. Shorter commitment to longer.
segment (prepaid, postpaid) -- two categories, no order. One-hot or binary encoding.
AI commonly applies the same encoding to all categoricals without checking whether the variable has a natural ordering. If you encode "Basic/Standard/Premium" as three one-hot columns, you've thrown away the ordering information. If you encode "Bank Transfer/Credit Card/Electronic Check/Mailed Check" ordinally, you've invented an ordering that doesn't exist.
Step 4: Implement and Self-Review
Direct Claude to implement the preprocessing pipeline using the strategies you chose: your imputation choices, your encoding choices, and scaling for the numeric features.
After Claude produces the code, run AI self-review with a targeted prompt: "List every transformation applied to the data before the train/test split. For each one, confirm it does not use any information from the test set."
This is not "does this look right?" That produces confident reassurance. This is a specific check against a specific failure class: data leakage through preprocessing. The prompt names what to look for, which forces Claude to examine each transformation individually.
Step 5: Stratified Split and Documentation
Direct Claude to perform a stratified train/test split. The split must preserve both the churn class distribution and a reasonable segment distribution in both sets. Verify: the churn proportion in train and test should be within 1 percentage point of the original.
Now document your preprocessing decisions. Direct Claude to write a summary of each choice with rationale: which encoding for which column and why, which imputation for which column and why. This is not busywork. When someone reviews this model in three months -- or when you come back to improve it -- the rationale is what makes the decisions reviewable rather than opaque.
Check: The preprocessing decisions document exists and lists at least three specific encoding or imputation choices with rationale. The stratified split preserves churn class proportion within 1 percentage point.