Learn by Directing AI
Unit 7

Observability

Step 1: What Ivan needs to know

Ivan runs a museum. He doesn't care about CPU usage or memory consumption. He cares about:

  • How many people browse the collection online?
  • How many members log in?
  • Which artifacts are most popular?
  • Is the site fast for tourists on mobile?

The default instrumentation AI generates -- request duration histograms, HTTP status code counters, memory gauges -- answers a different question: "Is the server under stress?" That's useful for the developer, not for the director.

Custom metrics bridge the gap. You design metrics that answer Ivan's questions, not the server's.

Step 2: Design custom metrics

Before writing any instrumentation code, decide what to measure.

Counters only go up. Good for: total collection searches, total member logins, total content views. You query the rate of change over time -- "searches per minute" -- rather than the absolute number.

Histograms track distributions. Good for: page load times, search response times. You get percentiles (p50, p95, p99) -- "95% of collection pages load in under 800ms."

Gauges go up and down. Good for: active sessions, current concurrent visitors. Less common because most web metrics are cumulative.

Design these metrics for the museum:

Metric name Type Labels What it answers
collection_searches_total Counter category How many searches, broken down by category filter
collection_page_load_seconds Histogram route_type (ssg/ssr/csr) How fast are pages loading, by rendering strategy?
member_logins_total Counter none How many members are logging in?
member_content_views_total Counter content_type (lectures/images/archives) What content are members accessing?

Notice what's not here: user_id as a label. Adding user_id to member_content_views_total would create one time series per member. With 400 members and 3 content types, that's 1,200 time series for a single metric. Scale that across all metrics and you overwhelm Prometheus storage and query performance. Keep labels to small, bounded sets.

Step 3: Instrument the application

Install the Prometheus client:

npm install prom-client

Direct Claude to create a metrics module that defines the custom metrics from Step 2. Then add instrumentation at the points where the application interacts with users:

  • Collection search handler: increment collection_searches_total with the category label
  • Page rendering: record collection_page_load_seconds with the route type
  • Auth middleware: increment member_logins_total on successful login
  • Content access handler: increment member_content_views_total with the content type

Expose a /metrics endpoint that Prometheus scrapes.

AI commonly generates comprehensive system-level instrumentation (every HTTP endpoint, every database query) while missing business-level metrics entirely. Review what Claude produces. If it added CPU and memory metrics but not collection_searches_total, the instrumentation answers the developer's questions, not Ivan's.

Step 4: Build Ivan's dashboard

Set up Grafana and connect it to Prometheus. Then build a dashboard for Ivan.

This dashboard answers his questions, not yours. No jargon. No raw Prometheus queries visible. Clear labels and titles:

  • "Collection Searches" -- a time series showing search volume over the past 24 hours
  • "Member Activity" -- a stat panel showing logins today
  • "Popular Categories" -- a bar chart showing artifact categories ranked by view count
  • "Site Speed" -- a gauge showing average page load time, with a green zone up to 2 seconds

A dashboard is a communication artifact. It tells a specific audience what they need to know. Twelve panels of raw Prometheus data is not a dashboard -- it's a monitoring wall that nobody reads. Ivan's dashboard has four panels because Ivan has four questions.

Step 5: Build the developer dashboard

Build a second dashboard for yourself. This one can be technical:

  • Request rate by HTTP status code (stacked area chart)
  • Response time distribution (p50, p95, p99)
  • Error rate (single stat with threshold)
  • Memory usage (gauge)
  • Active connections (time series)

The contrast between the two dashboards is the point. Different audiences need different views of the same system. Ivan sees whether people are using the museum's digital collection. You see whether the system serving it is healthy.

Send Ivan the business dashboard. He responds after reviewing it: "This is exactly what I've been wanting. I can show the board how many people are using the collection online. One question -- can we see which artifacts are most popular? The curatorial team wants to know what people are interested in."

That's a reasonable request. You could add an artifact_id label to collection views -- but consider the cardinality. 150 artifacts times the existing labels could multiply your time series significantly. Think about whether a top-10 query against the existing data would answer Ivan's question without adding a high-cardinality label.

✓ Check

Check: Open Ivan's Grafana dashboard. Browse the collection and search for "compass." Log in as a member and view a lecture recording. The dashboard should show the search count increment and the member content view increment -- live, not after a refresh.