Learn by Directing AI
All materials

CLAUDE.md

ML P3: Infrastructure Foundation

Project

Client: Emeka Okafor, Head of Customer Retention, Tunde Mobile (Lagos, Nigeria) What you're building: Reliability infrastructure for the churn prediction API from P2. Input validation, health checks, model versioning, structured error responses, experiment tracking as infrastructure, and reproducibility guarantees.

Tech Stack

  • Python 3.11+
  • FastAPI + uvicorn (serving)
  • Pydantic (input validation)
  • MLflow (experiment tracking)
  • scikit-learn (model -- already trained, you're adding infrastructure around it)
  • joblib (model serialization)
  • Git/GitHub (version control)

File Structure

materials/
  CLAUDE.md          -- this file
  emeka-reliability.md -- Emeka's email about the API outage
  tickets.md         -- work breakdown (13 tickets in 4 groups)
  api-baseline/      -- starting API code from P2
    app.py           -- basic FastAPI endpoint (no validation, no health check)
    model.pkl        -- trained churn model
    requirements.txt -- unpinned dependencies (to be fixed)
    data_profile.json -- training data ranges for validation boundaries

Tickets

  • T-01: Add Pydantic input validation with training data ranges
  • T-02: Add structured error responses (JSON, not stack traces)
  • T-03: Test validation with valid, boundary, and invalid inputs
  • T-04: Add health check endpoint that verifies model is loaded
  • T-05: Add model versioning to prediction responses
  • T-06: Test health check by simulating model absence
  • T-07: Set up MLflow experiment with structured logging
  • T-08: Verify MLflow logs actual variable values (not hardcoded)
  • T-09: Run controlled experiment with two model variants
  • T-10: Compare experiments in MLflow UI
  • T-11: Add random seeds for all random operations
  • T-12: Pin all library versions in requirements.txt
  • T-13: Verify reproducibility (same config twice gives identical results)

Verification Targets

  • Out-of-range input returns 422 with structured JSON error naming the field and valid range
  • Valid input returns 200 with prediction, probability, and model_version
  • /health endpoint reports model load status; returns unhealthy when model file is removed
  • MLflow shows experiment runs with parameters and metrics (no hardcoded values)
  • Same configuration run twice produces identical metric values
  • requirements.txt has pinned versions for all dependencies

Commit Convention

Commit after completing each ticket group (validation, health/versioning, experiment tracking, reproducibility). Use clear commit messages describing what was added and why.