ML P2: Churn Prediction Phase 2 -- Tunde Mobile
Client
Emeka Okafor, Head of Customer Retention at Tunde Mobile (Lagos, Nigeria). Returning client from P1. His team uses the P1 churn model weekly but it misses prepaid customers. His board wants documented methodology.
What you're building
An improved churn prediction model that handles the prepaid/postpaid segment gap, with a full artifact creation pipeline: PRD, evaluation design, documented preprocessing decisions, experiment tracking in MLflow, per-segment evaluation, API serving, and board-facing documentation.
Tech stack
- Python 3.11+ (conda
mlenvironment) - pandas (data loading, profiling, preprocessing)
- scikit-learn (preprocessing, training, evaluation)
- MLflow (experiment tracking)
- FastAPI + uvicorn (model serving)
- Jupyter (notebook workflow)
- Git/GitHub (version control)
File structure
ml/p2/
materials/
CLAUDE.md (this file)
emeka-followup.md (client email)
prd-template.md (PRD template)
subscribers-v2.csv (dataset: ~9,000 rows)
data-dictionary-v2.md (column reference)
tickets.md (ticket breakdown)
notebooks/ (Jupyter notebooks -- student creates)
docs/ (PRD, eval design, preprocessing decisions, eval results)
src/ (model code, serving endpoint)
Tickets
- T01: Project setup -- download materials, read CLAUDE.md, profile dataset
- T02: Client discovery -- ask Emeka about prepaid behavior and board needs
- T03: Data profiling -- compare P2 dataset against P1 baseline
- T04: PRD creation -- draft requirements document using template
- T05: Evaluation design -- choose metrics, define per-segment evaluation, set baselines
- T06: Baseline computation -- majority-class and logistic regression baselines
- T07: Per-segment evaluation plan -- define prepaid/postpaid separate metrics
- T08: Missing value analysis -- examine distributions before choosing imputation
- T09: Encoding decisions -- determine nominal vs ordinal for each categorical
- T10: Implement preprocessing -- imputation, encoding, scaling
- T11: AI self-review -- prompt Claude to verify preprocessing pipeline
- T12: Stratified split -- preserve class and segment distributions
- T13: Document preprocessing decisions -- rationale for each choice
- T14: MLflow setup -- configure experiment tracking
- T15: Train baseline logistic regression -- log to MLflow
- T16: Train RandomForest -- log to MLflow
- T17: Hyperparameter tuning -- cross-validation with fold variance check
- T18: Per-segment evaluation -- compute metrics for prepaid and postpaid separately
- T19: Experiment comparison -- use MLflow to compare runs
- T20: Serve best model -- FastAPI endpoint
- T21: Evaluation documentation -- board-facing results summary
- T22: Update PRD -- compare actual results against planned criteria
- T23: Client documentation review -- send to Emeka for board readiness feedback
- T24: Write README
- T25: Final commit and project close
Verification targets
- PRD includes prepaid gap, evaluation metrics with rationale, board-facing success criteria
- Evaluation design has per-segment metrics and baseline scores
- Preprocessing decisions document has at least three choices with rationale
- Stratified split preserves churn class proportion within 1 percentage point
- MLflow has at least two experiment runs with logged parameters and metrics
- Per-segment evaluation shows prepaid recall separately from postpaid recall
- API endpoint returns JSON with churn probability for valid requests
- Evaluation documentation includes per-segment results and baseline comparisons
- Repository contains all pipeline artifacts, committed
Commit convention
Commit after completing each ticket or logical group of tickets. Use descriptive messages: "Add PRD with evaluation criteria", "Implement preprocessing with documented decisions", "Train models with MLflow tracking".