Step 1: Review the Preprocessing Tickets
Open materials/tickets.md and find the preprocessing tickets. Check materials/CLAUDE.md if you need a reminder of the project structure. There are four tasks ahead: impute missing values, encode categorical features, scale numerical features, and perform a stratified train/test split.
Read through all four before you start any of them. The order matters. Imputation fills gaps in the raw data. Encoding converts categories into numbers the model can ingest. Scaling normalizes numeric ranges so no single feature dominates by magnitude. Splitting separates the data into training and test sets — and once that split happens, the test set is off-limits until evaluation.
Each of these is a separate task. You'll direct Claude through them one at a time, not all at once. A single prompt that says "preprocess the data" will produce something — but you won't know what choices were made or why. Breaking the work into focused requests means you can review each decision before moving to the next.
Step 2: Handle Missing Values
Check your data profile from Unit 1. Which columns have missing values? How many?
Direct Claude to impute the missing values. Something like: "Impute missing values in subscribers.csv. Show me which columns have missing values, what strategy you're using for each, and why."
Review what comes back. Claude will pick a strategy — probably filling numeric columns with the mean or median, and categorical columns with the mode. The question is whether those choices make sense for this data.
Mean imputation assumes the data is roughly symmetric. If a column is heavily skewed — monthly charges with a long right tail, for example — the mean gets pulled toward the outliers. Median imputation is more robust to skew. For categorical columns, mode imputation fills gaps with the most common value, which is usually reasonable unless the distribution is nearly uniform.
The point is not that one strategy is always right. The point is that imputation is a decision, and each strategy assumes something about the data. When you don't specify what you want, Claude picks defaults from its training — and those defaults may not match the distribution sitting in materials/subscribers.csv.
Step 3: Encode Categorical Features
Direct Claude to encode the categorical features. Ask it to show you which encoding it chose for each column and why.
Two common approaches: one-hot encoding and ordinal encoding. The difference matters. One-hot encoding creates a new binary column for each category — it treats all categories as equally different from each other. Ordinal encoding assigns integers — it implies an order. "Month-to-month," "one-year," and "two-year" contracts have a natural order: commitment length. Encoding them as 0, 1, 2 makes sense. Payment methods — "credit card," "bank transfer," "electronic check" — don't have a meaningful order. One-hot encoding is the safer choice for nominal categories like these.
If Claude uses one-hot encoding for everything, that works but misses the ordinal structure in contract type. If it uses ordinal encoding for everything, it imposes a fake ordering on nominal categories. Neither is catastrophic for this project, but noticing the difference is part of understanding what the preprocessing actually claims about the data.
Step 4: Scale and Split
Direct Claude to scale the numerical features and perform a stratified train/test split. Be specific: "Scale numerical features with StandardScaler. Split into 80/20 train/test with stratification on the churn column. Use random_state=42."
Why stratification? Remember the class imbalance from Unit 1 — only about 8% of subscribers churned. A random split could easily put 6% in the training set and 12% in the test set, or the reverse. That makes the training data unrepresentative and the test results unreliable. Stratified splitting forces both sets to preserve the original class distribution.
Why random_state=42? Without it, every run produces a different split. Your evaluation numbers change each time — not because the model improved, but because the test set changed. Setting a random state makes the split reproducible. Anyone who runs the same code gets the same split and the same results.
Why scale? Features like tenure_months (range 1-72) and total_charges (range in the thousands) live on very different scales. Some algorithms are sensitive to this — a feature with larger numbers can dominate simply because its values are bigger. StandardScaler centers each feature at zero with unit variance, putting them on comparable footing.
Step 5: Verify the Results
Before moving on, check three things.
First, the shapes. Direct Claude to print the shapes of the training and test sets. The rows should add up to the original dataset size. The columns should match — same features in both sets.
Second, the class distribution. Direct Claude to print the churn class proportion in both the training set and the test set. Compare them to the original proportion. If stratification worked, all three should be nearly identical — within a percentage point.
Third, no data was lost or duplicated. The total rows across train and test should equal the original row count. No subscriber should appear in both sets.
This verification step feels mechanical, but it catches real problems. A preprocessing pipeline is a sequence of transformations, and each step can silently drop rows, duplicate records, or scramble the class balance. Checking the output shapes and distributions against what you started with is how you confirm the pipeline did what you intended — not just that it ran without errors.
✓ Check: The churn class proportion in both train and test sets is within 1 percentage point of the original dataset proportion (~8%).