ML vs Classical Forecasting Comparison

ML vs Classical Forecasting Comparison provides a decision framework for choosing between traditional statistical methods (exponential smoothing, ARIMA) and machine learning methods (gradient boosting, neural networks) for workforce management demand forecasting. The answer is not "ML is better" or "classical is better" — it depends on data characteristics, organizational capability, and the specific forecasting problem.

The M-competition series — the largest and most rigorous forecasting competitions in the field — provide the empirical foundation for this comparison.

The M-competition evidence

The Makridakis competitions (M-competitions) are open forecasting tournaments that benchmark methods on thousands of real time series. They are the closest thing the forecasting field has to a controlled experiment at scale.

M4 Competition (2018)

100,000 time series across multiple frequencies (yearly, quarterly, monthly, weekly, daily, hourly). Key findings:

Statistical-ML hybrids won. The winning method (Smyl's hybrid) combined exponential smoothing for local structure with a recurrent neural network for cross-series learning. Pure ML methods did not dominate pure statistical methods at the individual series level.^[1]
Simple methods remained competitive. A well-tuned ETS model or Theta method beat most ML submissions on many series. The median ML submission did not beat the median statistical submission.
Combination worked. Simple averages of multiple methods outperformed most individual methods. The forecasting literature has known this since the 1960s, but practitioners routinely ignore it.

M5 Competition (2020)

Changed the game: 42,840 hierarchical sales series from Walmart with rich external features (prices, promotions, events, calendar effects). Key findings:

ML methods dominated. The top 50 submissions were overwhelmingly gradient boosting (LightGBM, XGBoost). For the first time in M-competition history, ML decisively outperformed classical methods.^[2]
Why ML won here: M5 had many related series sharing structure (hierarchical products in the same stores), rich external features (price changes, SNAP events), and abundant data. These are exactly the conditions where ML excels.
Classical methods struggled because the problem required learning cross-series patterns and incorporating many features — neither of which ETS or ARIMA handles natively.

The M-competition lesson for WFM

The M4 and M5 results together tell a clear story: the advantage of ML over classical methods depends on the structure of the problem.

M4-like problems (single or few series, no external features, moderate data) → classical methods are competitive or superior
M5-like problems (many related series, rich features, lots of data) → ML methods dominate

WFM forecasting problems span both types. A single-skill call volume forecast with no external regressors is an M4 problem. A multi-skill, multi-site forecast incorporating marketing events, weather, and cross-skill cannibalization patterns is an M5 problem.

Decision framework

Data volume

Data Volume	Classical	ML
< 100 observations	Strong — ETS/ARIMA designed for small samples	Weak — insufficient data for reliable training
100–1,000 observations	Strong	Moderate — viable with regularization
1,000–10,000 observations	Strong	Strong — enough data to learn patterns
10,000+ observations	Adequate but does not exploit scale	Very strong — more data = better models

WFM context: a single queue with 2 years of daily data has ~730 observations. That is comfortable for classical methods and marginal for ML. The same queue at 15-minute intervals has ~35,000 observations — firmly in ML territory if the model can handle the autocorrelation structure.

Number of series

Classical methods are designed for one series at a time. Each series gets its own model with its own parameters. This is fine for 5–10 skills but becomes unwieldy at 100+ skills across multiple sites.

ML methods can train a single global model across all series, learning shared patterns (day-of-week effects, holiday impacts) while allowing series-specific behavior through features (skill ID, site, channel). This is the "global model" or "cross-learning" approach that dominated M5.^[3]

At scale (50+ series), a single well-tuned LightGBM model will typically match or beat 50 individually tuned ETS models — and requires a fraction of the maintenance.

External variables

Classical: ARIMAX and dynamic regression can incorporate external regressors, but the relationship must be approximately linear (or transformed to be). Handling 20+ regressors in a classical framework is cumbersome. See External Regressors in WFM Forecasting for the full treatment.

ML: gradient boosting naturally handles dozens of features, captures non-linear relationships and interactions, and does not require the forecaster to specify the functional form. If the problem has rich features, ML has a structural advantage.

Computational budget

Classical: ETS and ARIMA fit in seconds on a single core. A forecast run for 100 series completes in under a minute. No GPU required.

ML (gradient boosting): LightGBM and XGBoost are fast. Training on 100K rows with 50 features takes seconds to minutes. Comparable to classical in practice.

ML (neural networks): Transformers and recurrent networks require significant compute. Training can take hours on GPU hardware. Inference is fast once trained, but the training loop is expensive. For most WFM applications, gradient boosting provides the ML benefits without the neural network compute cost.

Interpretability

Classical: ETS parameters (level, trend, seasonal components) have direct interpretable meaning. "The trend is +12 calls per week" is a statement operations managers understand. ARIMA coefficients are less intuitive but the model structure (differencing, seasonal period) is transparent.

ML: gradient boosting models are black boxes at the individual prediction level. SHAP values and feature importance provide post-hoc explanations, but they are approximations. Neural networks are even less interpretable.

For WFM, interpretability matters when:

The forecast must be explained to business stakeholders ("why does the model say 5,200 calls next Tuesday?")
The forecaster needs to diagnose forecast failures ("the model missed by 20% — why?")
Regulatory or governance requirements mandate explainable models (rare in WFM but common in adjacent domains like financial forecasting)

Head-to-head comparison

Criterion	Classical (ETS/ARIMA)	ML (Gradient Boosting)	Neural Networks
Accuracy on small data (< 500 obs)	Winner	Competitive with tuning	Weak
Accuracy on large data (10K+ obs)	Competitive	Winner	Strong with sufficient compute
Multi-series learning	Not native (each series independent)	Winner — global models	Strong — architectures like N-BEATS, TFT designed for this
External regressor handling	Limited (ARIMAX)	Winner — native feature support	Strong
Implementation effort	Low — well-documented, mature libraries	Medium — feature engineering required	High — architecture selection, hyperparameter tuning, GPU infra
Maintenance burden	Low — refit periodically	Medium — feature pipelines, data dependencies	High — model drift, retraining pipelines, infra ops
Interpretability	Winner — transparent parameters	Moderate — SHAP/feature importance	Weak — post-hoc only
Forecast horizon flexibility	Winner — direct multi-step forecasting	Requires recursive or direct strategy	Architecture-dependent
Time to first forecast	Hours	Days	Weeks

The hybrid approach

The M4 winner was a hybrid. The concept: use classical methods for what they do well (capturing local structure — trend, seasonality, level) and ML for what it does well (learning cross-series patterns and residual structure).

Architecture

Classical base model: fit ETS or seasonal ARIMA to each series individually. Extract the fitted values and residuals.
ML residual model: train a gradient boosting model on the residuals, using cross-series features (skill type, site, day-of-week, external regressors). The ML model learns the patterns that the classical model missed.
Combined forecast: classical forecast + ML residual prediction.

Why this works

The classical model handles the dominant signal (trend, seasonality) with minimal data and no feature engineering
The ML model focuses on the residual — a stationary, lower-variance signal that is easier to learn
The combination is more robust than either component alone — if the ML model fails or overfits, the classical base remains
The residual model can use features that the classical model cannot: external regressors, cross-series correlations, categorical variables

When to use

The hybrid approach is warranted when:

Individual series have clear seasonal structure (classical captures it well)
There are cross-series patterns or external features the classical model cannot exploit
The organization wants the interpretability of a classical base with the accuracy gains of ML
Robustness matters more than maximum accuracy — the hybrid degrades gracefully

Implementation

In Python, this is straightforward with statsmodels (or statsforecast) for the classical component and LightGBM for the residual model. The pipeline:

Fit ETS to each series using statsforecast's AutoETS
Extract residuals
Build a feature matrix: series ID, time features, external regressors, plus the ETS forecast itself as a feature
Train LightGBM on residuals
At forecast time: generate ETS forecast + LightGBM residual prediction, sum them

The R ecosystem offers similar tooling with the fable package for classical methods and tidymodels for ML.^[4]

When ML wins

ML provides a clear advantage when:

Many related series: 50+ skills/queues that share day-of-week, holiday, and event patterns. A global LightGBM model exploits this shared structure; 50 independent ETS models do not.
Rich feature space: marketing events, weather, pricing changes, app version, and other external variables that drive demand beyond what historical patterns capture.
Abundant data: years of high-frequency data (15-minute intervals) providing tens of thousands of training rows.
Non-linear effects: interactions between features that linear regression cannot capture — e.g., a marketing campaign during holiday season has a different effect than the sum of the campaign effect and the holiday effect.
Cold-start series: a new queue with no history can be forecasted by a global model using features from similar existing queues. Classical methods require per-series history.

When classical wins

Classical methods are the right choice when:

Limited data: a new queue with 3–6 months of daily data. ETS and ARIMA are designed for exactly this scenario. ML will overfit.
Single series: one queue to forecast, no cross-series learning opportunity. The overhead of ML feature engineering and pipeline maintenance is not justified.
Interpretability required: the forecast must be explained to non-technical stakeholders or pass governance review. ETS decomposition ("here is the trend, here is the seasonal pattern, here is the forecast") is universally understandable.
Minimal features: no external regressors, no cross-series structure. The problem is purely univariate temporal pattern extraction — classical methods' core competency.
Operational simplicity: a small WFM team without data engineering resources cannot maintain ML pipelines, feature stores, and model retraining infrastructure. A monthly ETS refit is operationally sustainable.

The WFM practitioner's decision tree

How many series? If < 10 with no shared structure, start classical. If 50+, evaluate ML.
External features available? If rich features exist (marketing, weather, events), ML has an advantage. If purely historical data, classical is sufficient.
Data volume per series? If < 500 observations, classical. If 5,000+, ML is viable.
Organizational capability? If the WFM team has data engineering and ML expertise (or access to it), ML is an option. If not, classical with good process discipline will outperform poorly maintained ML.
Start with the baseline. Run seasonal naive and well-tuned ETS. If a method cannot beat these, the problem is process, not methodology. Only then evaluate whether ML adds value on top.^[5]

Common mistakes

Deploying ML without a classical baseline. If you cannot show that ML beats ETS on your data, you are paying for complexity without benefit.
Over-engineering features. Adding 200 features to a LightGBM model does not guarantee better accuracy. Start with the obvious features (time, day of week, lag values) and add complexity incrementally, measuring improvement at each step.
Ignoring the maintenance cost. ML models degrade over time as data distributions shift. A model trained on 2023 data may underperform by mid-2024. Classical models degrade too, but their simplicity makes diagnosis and refit easier.
Confusing training accuracy with production accuracy. ML models can achieve near-zero error on training data through overfitting. Only holdout (out-of-time) accuracy matters. Cross-validation with temporal splits (never future data leaking into training) is mandatory.
Assuming neural networks are always better than gradient boosting. For tabular WFM data, gradient boosting (LightGBM, XGBoost) consistently matches or beats neural networks. The transformer revolution in NLP and vision has not (yet) translated to dominance on tabular forecasting problems.^[6]

Relationship to other pages

Forecasting Methods — parent page covering the full method taxonomy
Exponential Smoothing — the dominant classical method family
ARIMA Models — the second classical pillar
External Regressors in WFM Forecasting — the feature space that gives ML its advantage
Forecast Combination — ensembling methods that blend classical and ML outputs
Forecast Accuracy Metrics — the metrics for comparing methods (MASE is preferred for cross-method comparison)

↑ Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2018). The M4 Competition: Results, Findings, Conclusion and Way Forward. International Journal of Forecasting, 34(4), 802–808.
↑ Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). M5 Accuracy Competition: Results, Findings, and Conclusions. International Journal of Forecasting, 38(4), 1346–1364.
↑ Januschowski, T., Gasthaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke-Schneider, M., and Callot, L. (2020). Criteria for Classifying Forecasting Methods. International Journal of Forecasting, 36(1), 167–177.
↑ Hyndman, R.J. and Athanasopoulos, G. (2021). Forecasting: Principles and Practice. 3rd ed. OTexts. Chapter 12: Advanced Forecasting Methods.
↑ Petropoulos, F., Apiletti, D., Assimakopoulos, V., et al. (2022). Forecasting: Theory and Practice. International Journal of Forecasting, 38(3), 845–1130.
↑ Shwartz-Ziv, R. and Arber, A. (2022). Tabular Data: Deep Learning is Not All You Need. Information Fusion, 81, 84–90.

[1] Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2018). The M4 Competition: Results, Findings, Conclusion and Way Forward. International Journal of Forecasting, 34(4), 802–808.

[2] Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). M5 Accuracy Competition: Results, Findings, and Conclusions. International Journal of Forecasting, 38(4), 1346–1364.

[3] Januschowski, T., Gasthaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke-Schneider, M., and Callot, L. (2020). Criteria for Classifying Forecasting Methods. International Journal of Forecasting, 36(1), 167–177.

[4] Hyndman, R.J. and Athanasopoulos, G. (2021). Forecasting: Principles and Practice. 3rd ed. OTexts. Chapter 12: Advanced Forecasting Methods.

[5] Petropoulos, F., Apiletti, D., Assimakopoulos, V., et al. (2022). Forecasting: Theory and Practice. International Journal of Forecasting, 38(3), 845–1130.

[6] Shwartz-Ziv, R. and Arber, A. (2022). Tabular Data: Deep Learning is Not All You Need. Information Fusion, 81, 84–90.

[1]

[2]

[3]

[4]

[5]

[6]

Anonymous

Search

ML vs Classical Forecasting Comparison

Namespaces

More

Page actions

Contents

The M-competition evidence

M4 Competition (2018)

M5 Competition (2020)

The M-competition lesson for WFM

Decision framework

Data volume

Number of series

External variables

Computational budget

Interpretability

Head-to-head comparison

The hybrid approach

Architecture

Why this works

When to use

Implementation

When ML wins

When classical wins

The WFM practitioner's decision tree

Common mistakes

Relationship to other pages

Navigation

Navigation

Core WFM

Applied Science

Beyond Contact Centers

Strategy & Transformation

Signature Models

Community

Wiki tools

Wiki tools

Anonymous

Search

ML vs Classical Forecasting Comparison

The M-competition evidence

M4 Competition (2018)

M5 Competition (2020)

The M-competition lesson for WFM

Decision framework

Data volume

Number of series

External variables

Computational budget

Interpretability

Head-to-head comparison

The hybrid approach

Architecture

Why this works

When to use

Implementation

When ML wins

When classical wins

The WFM practitioner's decision tree

Common mistakes

Relationship to other pages

Navigation

Wiki tools

Page tools

Categories