P3 Tickets: Infrastructure Foundation
Group 1: Input Validation & Error Handling
T-01: Add Pydantic input validation with training data ranges
Add a Pydantic model that validates prediction requests against the training data's actual ranges and types. The validation should encode what the model was trained on -- not arbitrary limits.
Acceptance criteria:
- Pydantic model validates all input features
- Numeric fields have min/max constraints matching the training data profile (data_profile.json)
- Categorical fields accept only values present in the training data
- Invalid requests return 422 with a structured error response
T-02: Add structured error responses
Replace default FastAPI error handling with structured JSON error responses. Error responses should tell callers what went wrong and how to fix it -- not expose stack traces or internal paths.
Acceptance criteria:
- All error responses return structured JSON with field name, constraint, and received value
- No stack traces or internal file paths in any error response
- Error messages use plain language (e.g., "field 'tenure_months' must be between 1 and 72")
- HTTP status codes are appropriate (422 for validation, 500 for server errors)
T-03: Test validation with valid, boundary, and invalid inputs
Test the validation layer with a range of inputs: clearly valid, boundary values, and clearly invalid.
Acceptance criteria:
- Valid request returns 200 with prediction
- Boundary request (e.g., tenure_months=1, tenure_months=72) returns 200
- Out-of-range request (e.g., tenure_months=-5) returns 422 with structured error
- Missing required field returns 422 with structured error
- Wrong type (e.g., string where number expected) returns 422 with structured error
Group 2: Health Monitoring & Versioning
T-04: Add health check endpoint that verifies model is loaded
Add a /health endpoint that checks whether the model is actually loaded and functional -- not just whether the server process is running.
Acceptance criteria:
- /health endpoint exists at GET /health
- Response includes model_loaded status (true/false)
- Response includes model_version when loaded
- Health check verifies the model can produce a prediction on a reference input
- Returns 200 when healthy, 503 when unhealthy
T-05: Add model versioning to prediction responses
Include the model version in every prediction response so that production debugging is possible.
Acceptance criteria:
- Every prediction response includes a model_version field
- The version comes from model metadata or a version file (not hardcoded)
- The version is consistent with the /health endpoint's reported version
T-06: Test health check by simulating model absence
Verify the health check detects real failure states.
Acceptance criteria:
- Remove or rename the model file
- /health returns 503 with model_loaded: false
- /predict returns an appropriate error (not a stack trace)
- Restore the model file
- /health returns 200 with model_loaded: true
Group 3: Experiment Tracking Infrastructure
T-07: Set up MLflow experiment with structured logging
Configure MLflow to log all experiment parameters, metrics, and model artifacts systematically.
Acceptance criteria:
- MLflow experiment created with a descriptive name
- All hyperparameters logged as parameters (not hardcoded values)
- All evaluation metrics logged (overall and per-segment)
- Model artifact logged
- Data version or identifier logged
- MLflow runs use context managers (with mlflow.start_run())
T-08: Verify MLflow logs actual variable values
Check that every mlflow.log_param and mlflow.log_metric call logs the actual variable, not a hardcoded value.
Acceptance criteria:
- Review every log_param call -- each must reference a variable, not a literal
- Review every log_metric call -- each must reference a computed metric variable
- No log_metric call logs training accuracy as test accuracy
- All runs are properly closed (context manager or explicit end_run)
T-09: Run controlled experiment with two model variants
Train two model variants with different hyperparameters and log both to MLflow.
Acceptance criteria:
- Two experiments run with different hyperparameters
- Both logged to MLflow with distinct run IDs
- Parameters differ between runs
- Metrics differ between runs
- Both have model artifacts logged
T-10: Compare experiments in MLflow UI
Open the MLflow UI and compare the two runs.
Acceptance criteria:
- MLflow UI accessible at localhost:5000
- Both runs visible in the experiment view
- Parameter and metric columns visible for comparison
- Per-segment metrics (prepaid_recall, postpaid_recall) visible
Group 4: Reproducibility
T-11: Add random seeds for all random operations
Add random seeds for numpy, scikit-learn, and Python's random module to ensure deterministic execution.
Acceptance criteria:
- numpy random seed set
- scikit-learn random_state set for all estimators and splitters
- Python random seed set
- All seeds use the same base value for traceability
T-12: Pin all library versions in requirements.txt
Replace unpinned dependencies with exact version pins.
Acceptance criteria:
- Every dependency in requirements.txt has an exact version (e.g., scikit-learn==1.6.1)
- Versions are mutually compatible
- No dependency uses >= or latest
T-13: Verify reproducibility
Run the same configuration twice and confirm identical results.
Acceptance criteria:
- Same config, same seeds, same data produces identical metric values across two runs
- MLflow shows two runs with identical metrics
- If results differ, identify which component is non-deterministic and fix it