Learn by Directing AI

From: Emeka Okafor emeka.okafor@tundemobile.ng Subject: Reliability concerns with the churn API

Hi again,

Quick question -- well, not so quick actually.

My team ran the retention campaign last Tuesday and the API went down for about two hours in the morning. Nobody knew until Adaeze on my team tried to pull the weekly list and got an error. By the time it came back up, she'd already done the calls manually from last week's list.

Two things worry me. One: we didn't know the API was down until someone tried to use it. There's no monitoring, no health check, nothing that tells us "hey, the prediction service is offline." Two: my data team wants to run the model with different settings to see if they can improve it, but they're worried about breaking the one that's working. They said something about "we need to track our experiments properly."

Can we make this more reliable? I need to know the API is healthy, I need the team to be able to experiment without breaking production, and I need someone else on my team to be able to reproduce what we've built. Right now if you disappeared, nobody could recreate this.

Emeka

emeka-reliability.md