ML P3: Infrastructure Foundation
Project
Client: Emeka Okafor, Head of Customer Retention, Tunde Mobile (Lagos, Nigeria) What you're building: Reliability infrastructure for the churn prediction API from P2. Input validation, health checks, model versioning, structured error responses, experiment tracking as infrastructure, and reproducibility guarantees.
Tech Stack
- Python 3.11+
- FastAPI + uvicorn (serving)
- Pydantic (input validation)
- MLflow (experiment tracking)
- scikit-learn (model -- already trained, you're adding infrastructure around it)
- joblib (model serialization)
- Git/GitHub (version control)
File Structure
materials/
CLAUDE.md -- this file
emeka-reliability.md -- Emeka's email about the API outage
tickets.md -- work breakdown (13 tickets in 4 groups)
api-baseline/ -- starting API code from P2
app.py -- basic FastAPI endpoint (no validation, no health check)
model.pkl -- trained churn model
requirements.txt -- unpinned dependencies (to be fixed)
data_profile.json -- training data ranges for validation boundaries
Tickets
- T-01: Add Pydantic input validation with training data ranges
- T-02: Add structured error responses (JSON, not stack traces)
- T-03: Test validation with valid, boundary, and invalid inputs
- T-04: Add health check endpoint that verifies model is loaded
- T-05: Add model versioning to prediction responses
- T-06: Test health check by simulating model absence
- T-07: Set up MLflow experiment with structured logging
- T-08: Verify MLflow logs actual variable values (not hardcoded)
- T-09: Run controlled experiment with two model variants
- T-10: Compare experiments in MLflow UI
- T-11: Add random seeds for all random operations
- T-12: Pin all library versions in requirements.txt
- T-13: Verify reproducibility (same config twice gives identical results)
Verification Targets
- Out-of-range input returns 422 with structured JSON error naming the field and valid range
- Valid input returns 200 with prediction, probability, and model_version
- /health endpoint reports model load status; returns unhealthy when model file is removed
- MLflow shows experiment runs with parameters and metrics (no hardcoded values)
- Same configuration run twice produces identical metric values
- requirements.txt has pinned versions for all dependencies
Commit Convention
Commit after completing each ticket group (validation, health/versioning, experiment tracking, reproducibility). Use clear commit messages describing what was added and why.