ML P3: Infrastructure Foundation

Project

Client: Emeka Okafor, Head of Customer Retention, Tunde Mobile (Lagos, Nigeria) What you're building: Reliability infrastructure for the churn prediction API from P2. Input validation, health checks, model versioning, structured error responses, experiment tracking as infrastructure, and reproducibility guarantees.

Tech Stack

Python 3.11+
FastAPI + uvicorn (serving)
Pydantic (input validation)
MLflow (experiment tracking)
scikit-learn (model -- already trained, you're adding infrastructure around it)
joblib (model serialization)
Git/GitHub (version control)

File Structure

materials/
  CLAUDE.md          -- this file
  emeka-reliability.md -- Emeka's email about the API outage
  tickets.md         -- work breakdown (13 tickets in 4 groups)
  api-baseline/      -- starting API code from P2
    app.py           -- basic FastAPI endpoint (no validation, no health check)
    model.pkl        -- trained churn model
    requirements.txt -- unpinned dependencies (to be fixed)
    data_profile.json -- training data ranges for validation boundaries

Tickets

T-01: Add Pydantic input validation with training data ranges
T-02: Add structured error responses (JSON, not stack traces)
T-03: Test validation with valid, boundary, and invalid inputs
T-04: Add health check endpoint that verifies model is loaded
T-05: Add model versioning to prediction responses
T-06: Test health check by simulating model absence
T-07: Set up MLflow experiment with structured logging
T-08: Verify MLflow logs actual variable values (not hardcoded)
T-09: Run controlled experiment with two model variants
T-10: Compare experiments in MLflow UI
T-11: Add random seeds for all random operations
T-12: Pin all library versions in requirements.txt
T-13: Verify reproducibility (same config twice gives identical results)

Verification Targets

Out-of-range input returns 422 with structured JSON error naming the field and valid range
Valid input returns 200 with prediction, probability, and model_version
/health endpoint reports model load status; returns unhealthy when model file is removed
MLflow shows experiment runs with parameters and metrics (no hardcoded values)
Same configuration run twice produces identical metric values
requirements.txt has pinned versions for all dependencies

Commit Convention

Commit after completing each ticket group (validation, health/versioning, experiment tracking, reproducibility). Use clear commit messages describing what was added and why.