Learn by Directing AI
Unit 7

Build monitoring and alerting

Step 1: Think about who alerts serve

An alert that says "Dagster job failed" tells you something went wrong with the machine. An alert that says "the cost attribution report has not refreshed in 6 hours and the CFO reads it at 8am" tells you something went wrong for a person.

Infrastructure alerts (CPU usage, job status, memory) tell the engineer about the system. Business-outcome alerts tell the engineer about the people who depend on the data. AI defaults to infrastructure alerts because they require no business context.

The monitoring strategy starts with a question: who depends on this data, and what happens to them when it's wrong or late?

Step 2: Design the monitoring strategy

Map out what needs monitoring:

Condition Who's affected What they need Alert type
Cost attribution report > 6 hours stale CFO, board Current numbers for decisions Business-outcome
Daily BigQuery spend exceeds expected range Finance team Cost predictability Cost alert
Row count drops > 25% from yesterday Data team Data completeness Trend-based
Dagster job fails Data engineer Pipeline health Infrastructure

Direct AI to implement these. The first three are the ones AI won't generate on its own -- they require understanding the business context that Fatimah described.

Step 3: Implement business-outcome alerts

Configure the freshness alert for the cost attribution report. This is the alert Fatimah cares about most: the CFO reads the report at 8am. If the data is more than 6 hours stale at that point, the report is useless.

In Dagster, this maps to a freshness policy on the fct_cost_attribution asset. If the asset hasn't been materialized in the last 6 hours, trigger an alert.

Then implement cost alerting: when daily BigQuery spend exceeds an expected range (you'll set this based on the cost analysis from Unit 4), alert with model-level attribution. Not "BigQuery is expensive" but "the fct_daily_deliveries full refresh cost 3x the normal incremental run."

Step 4: Implement trend-based monitoring

Row count anomaly detection catches a class of failure that threshold tests miss. If Factory 1 normally delivers 150-200 records per day and today's count is 80, that's an anomaly -- even though 80 is above any reasonable minimum threshold.

Configure trend-based monitoring (Soda Core or custom SQL) that compares today's metrics against a rolling window. The detection should flag deviations that exceed the historical range, not just values below a fixed minimum.

AI commonly sets the sensitivity threshold arbitrarily -- either catching every minor fluctuation (noise) or missing genuine anomalies (too loose). Document your rationale for each threshold choice.

Step 5: Set thresholds deliberately

Each threshold is a trade-off between sensitivity and noise. Too tight: every normal variation triggers an alert. Too loose: genuine anomalies pass unnoticed.

For the cost attribution freshness: 6 hours is driven by Fatimah's requirement (CFO reads at 8am, data loads overnight). For the row count anomaly: base it on the historical variance you observed during profiling. For the cost alert: base it on the average daily cost from your INFORMATION_SCHEMA analysis.

Document the rationale for each threshold. "Set to 25%" is not documentation. "Set to 25% because historical daily variation is 10-15%, so 25% represents a 2x deviation from normal" is.

Step 6: Design backfill behavior

When you need to reload historical data (a backfill), the monitoring system faces a problem. A legitimate backfill produces large volume changes, unusual run patterns, and cost spikes that look identical to anomalies.

If the monitoring system can't distinguish a planned backfill from a data explosion, it will either flood you with false alarms during the backfill or be configured so loosely that real anomalies pass unnoticed.

Design a backfill annotation mechanism. Before a backfill starts, annotate the monitoring system so it knows to expect unusual volumes. The mechanism should suppress false positives during the planned operation without disabling anomaly detection entirely -- a real problem occurring during a backfill should still be caught.

Step 7: Apply the alert fatigue test

Review every alert you configured. For each one, answer:

  1. What condition triggers it?
  2. Who is affected?
  3. What action does the recipient take?

If you can't answer all three -- especially the third -- the alert is noise, not signal. An alert that fires but prompts no specific action is worse than no alert. It trains the recipient to ignore alerts, which means the one that matters gets ignored along with the rest.

JT Thompson has opinions about alert design. Check in with him -- he'll push back on anything that doesn't have a clear action path.

✓ Check

Check: List every alert configured. For each, state: (1) what condition triggers it, (2) who is affected, (3) what action the recipient takes. At least one alert is a business-outcome alert (not just "job failed").