Learn by Directing AI
Unit 1

The Reliability Problem

Step 1: Read Emeka's Email

Open materials/emeka-reliability.md. Emeka's churn API went down last Tuesday for two hours. Nobody knew until Adaeze on his team tried to pull the weekly retention list and got an error. She ended up calling people from last week's list because she had no other option.

Emeka wants three things: know when the API is unhealthy, let his data team experiment without breaking production, and make the system reproducible. Right now if you left, nobody could recreate what was built.

Step 2: Talk to Emeka

Open the chat with Emeka. His email gives the broad problem, but there are details worth asking about. What exactly happened during the outage? What has his data team tried so far? Are there other reliability problems beyond the big outage?

When you ask about the outage, he'll tell you about Adaeze -- she was doing the weekly retention calls, opened the dashboard, and the API just errored out. She waited an hour, tried again, same thing. By the time anyone told you, the server had crashed and nobody had noticed for two hours.

When you ask about his data team, he'll explain they want to try different model settings -- tweak the threshold, try different features -- but they're worried that if they change something, the predictions the retention team relies on will break. There's one model and one API, and if it breaks, the team is blind.

Step 3: Set Up the Project

Open your terminal and start Claude Code:

cd ~/dev
claude

Paste this prompt:

Create the folder ~/dev/ml/p3. Download the project materials from https://learnbydirectingai.dev/materials/ml/p3/materials.zip and extract them into that folder. Read CLAUDE.md -- it's the project governance file.

Claude will download the materials and set up the workspace. After it finishes, look at what's in materials/. You should see CLAUDE.md, emeka-reliability.md, tickets.md, and the api-baseline/ directory with the existing API code.

Step 4: Review the Ticket Breakdown

Open materials/tickets.md and scan the full ticket list. The work breaks into four groups: input validation and error handling, health monitoring and versioning, experiment tracking infrastructure, and reproducibility. Thirteen tickets total.

The tickets give structure to the infrastructure work. Each group addresses one of Emeka's concerns. The first group (validation and errors) is about the API's contract with callers. The second group (health checks and versioning) is about knowing when the system is working and which model produced each prediction. The third group (experiment tracking) is about letting the data team experiment safely. The fourth group (reproducibility) is about making sure someone else can reproduce the system.

Step 5: Review the Existing API Code

Direct Claude to load and display materials/api-baseline/app.py. This is the serving endpoint from P2. Look at what's there and what's missing.

The API has a single /predict endpoint that loads a model, accepts a JSON body, and returns predictions. That's it. No input validation beyond whatever FastAPI does by default. No health check endpoint. No model versioning in responses. No structured error handling -- if something goes wrong, the caller gets whatever error FastAPI generates.

This is the starting point. The model works. The predictions are correct. But the system around it has no infrastructure. When the model file went missing last Tuesday, there was nothing to detect the failure, nothing to report it, and nothing to tell Emeka's team what went wrong.

✓ Check

Check: The student can name at least three specific reliability gaps in the current API (e.g., no input validation, no health check, no model versioning).