Forecast Accuracy Metrics

From WFM Labs

Forecast Accuracy Metrics are the measurements practitioners use to evaluate forecast quality. The choice of metric matters as much as the choice of method — different metrics reward different behaviors, and the wrong metric for the wrong context produces forecasts optimized for the wrong outcome.

This page documents the metrics WFM practitioners encounter, their behavior under different data shapes, and which to use when. The headline rule: do not optimize MAPE on intermittent demand.

Why metrics matter

A forecast cannot be evaluated against itself. Accuracy is measured against actuals on held-out data — either historical periods reserved for testing or a rolling production comparison. The metric is the function that turns the difference between forecast and actual into a single comparable number.

Different metrics produce different rankings. A method that wins on MAPE may lose on RMSE. A method that ties two competitors on RMSE may tie or differ dramatically on MASE. The metric is a value choice — what kind of error matters more.

In WFM contexts, the choice has operational consequences. A forecast that minimizes MAPE may systematically under-predict skill-level intermittent volume because percentage errors blow up near zero. A forecast that minimizes RMSE may produce smoother but less responsive forecasts that miss legitimate spikes. The practitioner needs to know what each metric rewards.

The core metrics

Throughout, let yt be the actual at time t and y^t the forecast for time t, with errors et=yty^t.

Mean Absolute Error (MAE)

MAE=1nt=1n|et|

In the units of the data. Easy to interpret: "on average, the forecast is off by N calls per interval." Robust to scale; treats all errors equally regardless of size.

When to use: general-purpose accuracy on a single series. Reporting accuracy in business-friendly units.

Limitations: cannot compare across series with different scales (MAE on a queue of 1000 vs 10 are not comparable).

Root Mean Squared Error (RMSE)

RMSE=1nt=1net2

Also in the units of the data. Penalizes larger errors more than smaller ones (because of the squaring). A method with rare large errors will score worse on RMSE than on MAE.

When to use: when large errors are operationally costly and you want the metric to reflect that. Common in capacity planning where a forecast 50% over the actual is much more expensive than a forecast 5% over.

Limitations: same scale issue as MAE; no cross-series comparison.

Mean Absolute Percentage Error (MAPE)

MAPE=100nt=1n|etyt|

Scale-free; expressed as a percentage. The default metric most people reach for first because the units are intuitive ("our forecast is 8% off on average").

When to use: communicating accuracy to non-technical audiences; comparing methods across series with different scales (cautiously).

Critical limitation in WFM: MAPE is undefined when yt=0 and explodes when yt is small. Any series with intermittent demand (skill-level data with frequent zero intervals; low-volume specialty queues) makes MAPE unstable or unusable.

A second subtle issue: MAPE is asymmetric. A forecast that over-predicts by x% produces a different MAPE than a forecast that under-predicts by x%. This biases optimization toward under-prediction.

Symmetric Mean Absolute Percentage Error (sMAPE)

sMAPE=100nt=1n|et|(|yt|+|y^t|)/2

A symmetric version that handles the asymmetry of MAPE. Bounded between 0 and 200. Still unstable when both yt and y^t are near zero.

When to use: as a less-misleading alternative to MAPE on series that have small but non-zero values. Better than MAPE; still not robust on truly intermittent series.

Mean Absolute Scaled Error (MASE)

MASE=1nt=1n|et|1Tmt=m+1T|ytytm|

The Hyndman-recommended metric. The denominator is the in-sample MAE of the seasonal naive forecast on the training data; m is the seasonal period.

A MASE of 1.0 means the forecast performs exactly as well as in-sample seasonal naive. MASE less than 1.0 means the forecast beats naive. MASE greater than 1.0 means the forecast is worse than naive — which usually means the method is broken.

When to use: the default for cross-series comparison. The natural metric for benchmarking against the seasonal naive baseline at scale. Robust on intermittent demand because the denominator is non-zero as long as the training series has any variation at the seasonal lag.

Why Hyndman recommends it: MASE is scale-free, comparable across series, defined for intermittent data, and grounded in the benchmark imperative (every method must beat naive).

Mean Error / Forecast Bias

ME=1nt=1net

Not an accuracy metric in the usual sense — ME measures the bias of the forecast. Positive ME means the forecast systematically under-predicts (actuals exceed forecasts on average); negative ME means systematic over-prediction.

When to use: diagnostic alongside any accuracy metric. A forecast can have low MAE and large bias, indicating that errors are small individually but skewed in one direction. Bias has different operational consequences than variance: an unbiased forecast missing by 8% in either direction is different from a biased forecast that systematically misses by 8% on the low side.

Choosing a metric

A practical decision matrix:

Use case Recommended metric Reasoning
General-purpose, single series, well-behaved data MAE or RMSE Easy to interpret in units of the data
Capacity planning, large errors costly RMSE Penalizes the operationally expensive errors
Communicating accuracy to non-technical audiences MAPE (with caveats) Percentage is intuitive
Series with intermittent or zero values MASE MAPE is unusable on these series
Cross-series comparison (multiple skills, multiple sites) MASE Scale-free; comparable
Detecting systematic over- or under-prediction ME (alongside MAE/RMSE) Bias is separate from variance
Method benchmarking against seasonal naive MASE Built-in benchmark interpretation

Time series cross-validation

Computing accuracy on a single train-test split is not enough. The held-out period may be unrepresentative — a holiday, a launch, an outage. Time series cross-validation (also called "rolling origin" or "expanding window") repeats the train-test split across multiple origins:

  1. Train on data through period T1; forecast h periods ahead; record errors
  2. Train on data through period T1+1; forecast h periods ahead; record errors
  3. Continue until the data runs out
  4. Aggregate errors across all origins for the final accuracy metric

This produces a more reliable estimate of forecast accuracy than a single split, particularly when the time series has changing characteristics.

For long horizons, vary h as well: a method that wins at h=1 may lose at h=12. Report accuracy by horizon, not just an aggregate number.

Common WFM pitfalls

  • Optimizing MAPE on skill-level data with frequent zero intervals. MAPE explodes; the optimizer chases the explosions. Use MASE or WAPE (a weighted variant) instead.
  • Single-split evaluation. A method that "wins" on one held-out month may have been lucky. Use rolling-origin cross-validation.
  • Reporting only one metric. Bias (ME) and variance (MAE, RMSE) are separate phenomena. Report both. A low-MAE high-bias forecast and a low-MAE low-bias forecast have different operational implications.
  • Comparing methods across different held-out windows. Method A on Q3 and Method B on Q4 are not comparable. Same data, same metric, same horizon.
  • Comparing across different series without scaling. MAPE comparisons across skills with very different volumes hide which skill has the worst forecast quality. Use MASE.

Forecast accuracy in WFM software

Most WFM platforms report MAPE by default — sometimes MAPE is the only available metric. This is a usability decision (MAPE is easy to explain) but a methodological liability for skill-level reporting.

When evaluating WFM platform forecasts:

  • If only MAPE is reported, request raw forecast and actual data to compute MASE independently
  • If the platform's MAPE looks great but the forecast still feels off, the platform may be biased — request ME (or compute it from the raw data)
  • Beware platforms that report MAPE without specifying which periods are included; small denominators get filtered out, biasing the reported number

References

  • Hyndman, R. J., & Athanasopoulos, G. "Evaluating forecast accuracy." Forecasting: Principles and Practice (Python edition). otexts.com/fpppy.
  • Hyndman, R. J., & Koehler, A. B. "Another look at measures of forecast accuracy." International Journal of Forecasting 22(4), 2006. The paper introducing MASE.

See Also