Simpson's Paradox in Contact Center Metrics

From WFM Labs

Simpson's Paradox in Contact Center Metrics is the appearance, in aggregated workforce management data, of a trend that reverses or disappears when the data are broken into their underlying groups. A team, channel, or vendor can look better on a blended metric while being worse in every individual segment. The effect is named for Edward Simpson, who formalized it in 1951, though it had been noted earlier by Pearson, Yule, and others.[1] Because contact center reporting is built on aggregation — rolling intervals into days, queues into sites, segments into a single number — the paradox is a routine hazard rather than a curiosity.

How the reversal happens

Simpson's paradox arises when a hidden variable, called a confounder, is distributed unevenly across the groups being compared and also affects the outcome. The classic real-world example is the 1973 University of California, Berkeley admissions data, where the university appeared to admit men at a higher rate than women overall, yet within most individual departments women were admitted at equal or higher rates. The aggregate reversed because women applied disproportionately to more competitive departments with lower admission rates for everyone.[2] The aggregate was not wrong; it answered a different question than the per-department figures, and the two should not be conflated.[3]

Where it shows up in WFM

The same structure recurs throughout contact center measurement whenever a confounder such as contact mix, channel, tenure, or volume differs across the things being compared:

  • Blended service level. An overall service level can sit comfortably above target while a high-volume, hard-to-staff segment misses badly, because easy, high-volume intervals dominate the blend. The aggregate hides the segment that customers actually experience as failure.
  • Vendor and site comparison. A vendor can post a better blended AHT or quality score while handling an easier contact mix; normalized by interaction type, the in-house team may outperform it in every category. Comparing blended numbers across sites with different mixes is the most common WFM instance of the paradox.
  • Agent and cohort performance. A tenured cohort handling complex work can show worse aggregate metrics than a new cohort handling simple work, inverting the true skill ranking.
  • Before-and-after analysis. If the contact mix shifts between two periods — more digital, more complex, post-automation residual work — a blended metric can move in a direction opposite to performance within each contact type. This interacts with regression to the mean to make naive before-versus-after comparisons especially treacherous.
  • AI deflection effects. When automation removes the simplest contacts, the human queue's blended AHT and difficulty rise even if agents handle each contact type exactly as well as before — an aggregate shift driven entirely by mix, not by performance.

Avoiding the trap

  • Segment before concluding. Always check whether an aggregate trend holds within the relevant groups — contact type, channel, tenure, complexity — before acting on it.
  • Compare like with like. Normalize cross-site and cross-vendor comparisons for contact mix; compare within segments, not across blends.
  • Identify the confounder. Ask what differs between the groups that also drives the metric. This is where Simpson's paradox connects to causal reasoning and causal inference: deciding which view (aggregated or segmented) answers the question requires knowing the causal structure, not just the numbers.
  • Report mix alongside the metric. A blended number is only interpretable next to the composition that produced it.

Maturity Model Position

In the WFM Labs Maturity Model™, resistance to aggregation artifacts marks the move from surface reporting to genuine analysis.

  • Level 1–2 (Emerging / Foundational) — decisions are made on blended top-line numbers; cross-site and cross-vendor comparisons ignore mix, and segment-level reversals go unnoticed.
  • Level 3 (Progressive) — reporting is routinely segmented, comparisons are normalized for contact mix, and aggregates are read alongside their composition.
  • Level 4–5 (Advanced / Pioneering) — the causal structure behind comparisons is modeled explicitly (see Causal Inference in Workforce Management), and automated reporting surfaces segment-level divergence rather than burying it in a blend.

See also

References

  1. Simpson, E. H. (1951). "The Interpretation of Interaction in Contingency Tables". Journal of the Royal Statistical Society, Series B, 13(2), 238–241. doi:10.1111/j.2517-6161.1951.tb00088.x.
  2. Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975). "Sex Bias in Graduate Admissions: Data from Berkeley". Science, 187(4175), 398–404. doi:10.1126/science.187.4175.398.
  3. Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press. ISBN 978-0-521-89560-6.