Fairness Audit Guide

What is disaggregated evaluation?

Aggregate metrics tell you how a model performs overall. Disaggregated evaluation breaks those metrics down by subgroups -- demographic groups, regions, categories -- to see whether the model performs equally well for everyone. A model with 85% accuracy overall might have 90% accuracy for one group and 60% for another. The aggregate hides the disparity.

How to disaggregate predictions

The basic pattern: compute your evaluation metrics separately for each value of a demographic column.

for group in df["demographic_column"].unique():
    group_mask = df["demographic_column"] == group
    group_preds = predictions[group_mask]
    group_labels = true_labels[group_mask]
    
    accuracy = accuracy_score(group_labels, group_preds)
    f1 = f1_score(group_labels, group_preds)
    precision = precision_score(group_labels, group_preds)
    recall = recall_score(group_labels, group_preds)
    
    print(f"{group}: Accuracy={accuracy:.3f}, F1={f1:.3f}, "
          f"Precision={precision:.3f}, Recall={recall:.3f}")

Compare each group's metrics against the overall mean. Disparities of 10% or more in key metrics warrant investigation.

Common fairness metrics

Demographic parity: Each group should receive positive outcomes at roughly the same rate. If 75% of nurses from Region A get placed but only 55% from Region B, that is a demographic parity gap. Measures: ratio of positive outcome rates between groups.

Equalized odds: The model's true positive rate and false positive rate should be similar across groups. A model that correctly identifies good matches 90% of the time for Region A but only 70% for Region B has an equalized odds gap. Measures: difference in TPR and FPR between groups.

Disparate impact ratio: The ratio of positive outcome rates between the least-favored group and the most-favored group. A ratio below 0.8 (the "80% rule") is a common threshold for flagging potential discrimination. Compute: min(group_rates) / max(group_rates).

These metrics measure different things. A model can satisfy one fairness criterion while violating another. The choice of which metric matters most depends on the business context and what kind of unfairness is most harmful.

Intervention options

Rebalancing training data: Oversample underrepresented groups or undersample overrepresented ones so the model sees equal representation during training. This addresses the root cause (biased training data) but may reduce overall accuracy if the original distribution was informative.

Threshold adjustment: Set different classification thresholds per group so that each group achieves roughly the same positive outcome rate. This preserves the model's learned patterns but adjusts the decision boundary. The trade-off: it may increase false positives for some groups.

Constraint-based training: Add fairness constraints directly to the training objective. The model optimizes for both accuracy and fairness simultaneously. Libraries like Fairlearn provide tools for this. The trade-off: overall accuracy may decrease slightly because the model can no longer exploit biased patterns for easy gains.

Post-processing calibration: Adjust predictions after the model has been trained. Similar to threshold adjustment but can be more sophisticated (calibrating probabilities rather than just thresholds). This is the least invasive but also the most superficial -- it doesn't change what the model learned.

Communicating findings to stakeholders

When you find a fairness issue, the stakeholder needs to understand three things:

What the disparity is -- in plain terms, not statistical jargon. "Nurses from Region X get matched at a 55% rate while nurses from other regions get matched at 75%."
Why it exists -- the training data reflects historical patterns. The model learned those patterns. It is not making a judgment; it is reproducing what happened before.
What the trade-offs are -- fixing the disparity may slightly reduce overall performance. Be honest about this. "If we make placement rates more equal, the average match quality score may drop from 0.82 to 0.79. But the model will no longer systematically disadvantage nurses from one region."

The stakeholder decides whether the trade-off is acceptable. Your job is to make the decision informed.