Learn by Directing AI
All materials

tickets.md

P3 Tickets: Infrastructure Foundation

Group 1: Input Validation & Error Handling

T-01: Add Pydantic input validation with training data ranges

Add a Pydantic model that validates prediction requests against the training data's actual ranges and types. The validation should encode what the model was trained on -- not arbitrary limits.

Acceptance criteria:

  • Pydantic model validates all input features
  • Numeric fields have min/max constraints matching the training data profile (data_profile.json)
  • Categorical fields accept only values present in the training data
  • Invalid requests return 422 with a structured error response

T-02: Add structured error responses

Replace default FastAPI error handling with structured JSON error responses. Error responses should tell callers what went wrong and how to fix it -- not expose stack traces or internal paths.

Acceptance criteria:

  • All error responses return structured JSON with field name, constraint, and received value
  • No stack traces or internal file paths in any error response
  • Error messages use plain language (e.g., "field 'tenure_months' must be between 1 and 72")
  • HTTP status codes are appropriate (422 for validation, 500 for server errors)

T-03: Test validation with valid, boundary, and invalid inputs

Test the validation layer with a range of inputs: clearly valid, boundary values, and clearly invalid.

Acceptance criteria:

  • Valid request returns 200 with prediction
  • Boundary request (e.g., tenure_months=1, tenure_months=72) returns 200
  • Out-of-range request (e.g., tenure_months=-5) returns 422 with structured error
  • Missing required field returns 422 with structured error
  • Wrong type (e.g., string where number expected) returns 422 with structured error

Group 2: Health Monitoring & Versioning

T-04: Add health check endpoint that verifies model is loaded

Add a /health endpoint that checks whether the model is actually loaded and functional -- not just whether the server process is running.

Acceptance criteria:

  • /health endpoint exists at GET /health
  • Response includes model_loaded status (true/false)
  • Response includes model_version when loaded
  • Health check verifies the model can produce a prediction on a reference input
  • Returns 200 when healthy, 503 when unhealthy

T-05: Add model versioning to prediction responses

Include the model version in every prediction response so that production debugging is possible.

Acceptance criteria:

  • Every prediction response includes a model_version field
  • The version comes from model metadata or a version file (not hardcoded)
  • The version is consistent with the /health endpoint's reported version

T-06: Test health check by simulating model absence

Verify the health check detects real failure states.

Acceptance criteria:

  • Remove or rename the model file
  • /health returns 503 with model_loaded: false
  • /predict returns an appropriate error (not a stack trace)
  • Restore the model file
  • /health returns 200 with model_loaded: true

Group 3: Experiment Tracking Infrastructure

T-07: Set up MLflow experiment with structured logging

Configure MLflow to log all experiment parameters, metrics, and model artifacts systematically.

Acceptance criteria:

  • MLflow experiment created with a descriptive name
  • All hyperparameters logged as parameters (not hardcoded values)
  • All evaluation metrics logged (overall and per-segment)
  • Model artifact logged
  • Data version or identifier logged
  • MLflow runs use context managers (with mlflow.start_run())

T-08: Verify MLflow logs actual variable values

Check that every mlflow.log_param and mlflow.log_metric call logs the actual variable, not a hardcoded value.

Acceptance criteria:

  • Review every log_param call -- each must reference a variable, not a literal
  • Review every log_metric call -- each must reference a computed metric variable
  • No log_metric call logs training accuracy as test accuracy
  • All runs are properly closed (context manager or explicit end_run)

T-09: Run controlled experiment with two model variants

Train two model variants with different hyperparameters and log both to MLflow.

Acceptance criteria:

  • Two experiments run with different hyperparameters
  • Both logged to MLflow with distinct run IDs
  • Parameters differ between runs
  • Metrics differ between runs
  • Both have model artifacts logged

T-10: Compare experiments in MLflow UI

Open the MLflow UI and compare the two runs.

Acceptance criteria:

  • MLflow UI accessible at localhost:5000
  • Both runs visible in the experiment view
  • Parameter and metric columns visible for comparison
  • Per-segment metrics (prepaid_recall, postpaid_recall) visible

Group 4: Reproducibility

T-11: Add random seeds for all random operations

Add random seeds for numpy, scikit-learn, and Python's random module to ensure deterministic execution.

Acceptance criteria:

  • numpy random seed set
  • scikit-learn random_state set for all estimators and splitters
  • Python random seed set
  • All seeds use the same base value for traceability

T-12: Pin all library versions in requirements.txt

Replace unpinned dependencies with exact version pins.

Acceptance criteria:

  • Every dependency in requirements.txt has an exact version (e.g., scikit-learn==1.6.1)
  • Versions are mutually compatible
  • No dependency uses >= or latest

T-13: Verify reproducibility

Run the same configuration twice and confirm identical results.

Acceptance criteria:

  • Same config, same seeds, same data produces identical metric values across two runs
  • MLflow shows two runs with identical metrics
  • If results differ, identify which component is non-deterministic and fix it