ML vs Classical Forecasting Comparison
ML vs Classical Forecasting Comparison provides a decision framework for choosing between traditional statistical methods (exponential smoothing, ARIMA) and machine learning methods (gradient boosting, neural networks) for workforce management demand forecasting. The answer is not "ML is better" or "classical is better" — it depends on data characteristics, organizational capability, and the specific forecasting problem.
The M-competition series — the largest and most rigorous forecasting competitions in the field — provide the empirical foundation for this comparison.
The M-competition evidence
The Makridakis competitions (M-competitions) are open forecasting tournaments that benchmark methods on thousands of real time series. They are the closest thing the forecasting field has to a controlled experiment at scale.
M4 Competition (2018)
100,000 time series across multiple frequencies (yearly, quarterly, monthly, weekly, daily, hourly). Key findings:
- Statistical-ML hybrids won. The winning method (Smyl's hybrid) combined exponential smoothing for local structure with a recurrent neural network for cross-series learning. Pure ML methods did not dominate pure statistical methods at the individual series level.[1]
- Simple methods remained competitive. A well-tuned ETS model or Theta method beat most ML submissions on many series. The median ML submission did not beat the median statistical submission.
- Combination worked. Simple averages of multiple methods outperformed most individual methods. The forecasting literature has known this since the 1960s, but practitioners routinely ignore it.
M5 Competition (2020)
Changed the game: 42,840 hierarchical sales series from Walmart with rich external features (prices, promotions, events, calendar effects). Key findings:
- ML methods dominated. The top 50 submissions were overwhelmingly gradient boosting (LightGBM, XGBoost). For the first time in M-competition history, ML decisively outperformed classical methods.[2]
- Why ML won here: M5 had many related series sharing structure (hierarchical products in the same stores), rich external features (price changes, SNAP events), and abundant data. These are exactly the conditions where ML excels.
- Classical methods struggled because the problem required learning cross-series patterns and incorporating many features — neither of which ETS or ARIMA handles natively.
The M-competition lesson for WFM
The M4 and M5 results together tell a clear story: the advantage of ML over classical methods depends on the structure of the problem.
- M4-like problems (single or few series, no external features, moderate data) → classical methods are competitive or superior
- M5-like problems (many related series, rich features, lots of data) → ML methods dominate
WFM forecasting problems span both types. A single-skill call volume forecast with no external regressors is an M4 problem. A multi-skill, multi-site forecast incorporating marketing events, weather, and cross-skill cannibalization patterns is an M5 problem.
Decision framework
Data volume
| Data Volume | Classical | ML |
|---|---|---|
| < 100 observations | Strong — ETS/ARIMA designed for small samples | Weak — insufficient data for reliable training |
| 100–1,000 observations | Strong | Moderate — viable with regularization |
| 1,000–10,000 observations | Strong | Strong — enough data to learn patterns |
| 10,000+ observations | Adequate but does not exploit scale | Very strong — more data = better models |
WFM context: a single queue with 2 years of daily data has ~730 observations. That is comfortable for classical methods and marginal for ML. The same queue at 15-minute intervals has ~35,000 observations — firmly in ML territory if the model can handle the autocorrelation structure.
Number of series
Classical methods are designed for one series at a time. Each series gets its own model with its own parameters. This is fine for 5–10 skills but becomes unwieldy at 100+ skills across multiple sites.
ML methods can train a single global model across all series, learning shared patterns (day-of-week effects, holiday impacts) while allowing series-specific behavior through features (skill ID, site, channel). This is the "global model" or "cross-learning" approach that dominated M5.[3]
At scale (50+ series), a single well-tuned LightGBM model will typically match or beat 50 individually tuned ETS models — and requires a fraction of the maintenance.
External variables
Classical: ARIMAX and dynamic regression can incorporate external regressors, but the relationship must be approximately linear (or transformed to be). Handling 20+ regressors in a classical framework is cumbersome. See External Regressors in WFM Forecasting for the full treatment.
ML: gradient boosting naturally handles dozens of features, captures non-linear relationships and interactions, and does not require the forecaster to specify the functional form. If the problem has rich features, ML has a structural advantage.
Computational budget
Classical: ETS and ARIMA fit in seconds on a single core. A forecast run for 100 series completes in under a minute. No GPU required.
ML (gradient boosting): LightGBM and XGBoost are fast. Training on 100K rows with 50 features takes seconds to minutes. Comparable to classical in practice.
ML (neural networks): Transformers and recurrent networks require significant compute. Training can take hours on GPU hardware. Inference is fast once trained, but the training loop is expensive. For most WFM applications, gradient boosting provides the ML benefits without the neural network compute cost.
Interpretability
Classical: ETS parameters (level, trend, seasonal components) have direct interpretable meaning. "The trend is +12 calls per week" is a statement operations managers understand. ARIMA coefficients are less intuitive but the model structure (differencing, seasonal period) is transparent.
ML: gradient boosting models are black boxes at the individual prediction level. SHAP values and feature importance provide post-hoc explanations, but they are approximations. Neural networks are even less interpretable.
For WFM, interpretability matters when:
- The forecast must be explained to business stakeholders ("why does the model say 5,200 calls next Tuesday?")
- The forecaster needs to diagnose forecast failures ("the model missed by 20% — why?")
- Regulatory or governance requirements mandate explainable models (rare in WFM but common in adjacent domains like financial forecasting)
Head-to-head comparison
| Criterion | Classical (ETS/ARIMA) | ML (Gradient Boosting) | Neural Networks |
|---|---|---|---|
| Accuracy on small data (< 500 obs) | Winner | Competitive with tuning | Weak |
| Accuracy on large data (10K+ obs) | Competitive | Winner | Strong with sufficient compute |
| Multi-series learning | Not native (each series independent) | Winner — global models | Strong — architectures like N-BEATS, TFT designed for this |
| External regressor handling | Limited (ARIMAX) | Winner — native feature support | Strong |
| Implementation effort | Low — well-documented, mature libraries | Medium — feature engineering required | High — architecture selection, hyperparameter tuning, GPU infra |
| Maintenance burden | Low — refit periodically | Medium — feature pipelines, data dependencies | High — model drift, retraining pipelines, infra ops |
| Interpretability | Winner — transparent parameters | Moderate — SHAP/feature importance | Weak — post-hoc only |
| Forecast horizon flexibility | Winner — direct multi-step forecasting | Requires recursive or direct strategy | Architecture-dependent |
| Time to first forecast | Hours | Days | Weeks |
The hybrid approach
The M4 winner was a hybrid. The concept: use classical methods for what they do well (capturing local structure — trend, seasonality, level) and ML for what it does well (learning cross-series patterns and residual structure).
Architecture
- Classical base model: fit ETS or seasonal ARIMA to each series individually. Extract the fitted values and residuals.
- ML residual model: train a gradient boosting model on the residuals, using cross-series features (skill type, site, day-of-week, external regressors). The ML model learns the patterns that the classical model missed.
- Combined forecast: classical forecast + ML residual prediction.
Why this works
- The classical model handles the dominant signal (trend, seasonality) with minimal data and no feature engineering
- The ML model focuses on the residual — a stationary, lower-variance signal that is easier to learn
- The combination is more robust than either component alone — if the ML model fails or overfits, the classical base remains
- The residual model can use features that the classical model cannot: external regressors, cross-series correlations, categorical variables
When to use
The hybrid approach is warranted when:
- Individual series have clear seasonal structure (classical captures it well)
- There are cross-series patterns or external features the classical model cannot exploit
- The organization wants the interpretability of a classical base with the accuracy gains of ML
- Robustness matters more than maximum accuracy — the hybrid degrades gracefully
Implementation
In Python, this is straightforward with statsmodels (or statsforecast) for the classical component and LightGBM for the residual model. The pipeline:
- Fit ETS to each series using statsforecast's AutoETS
- Extract residuals
- Build a feature matrix: series ID, time features, external regressors, plus the ETS forecast itself as a feature
- Train LightGBM on residuals
- At forecast time: generate ETS forecast + LightGBM residual prediction, sum them
The R ecosystem offers similar tooling with the fable package for classical methods and tidymodels for ML.[4]
When ML wins
ML provides a clear advantage when:
- Many related series: 50+ skills/queues that share day-of-week, holiday, and event patterns. A global LightGBM model exploits this shared structure; 50 independent ETS models do not.
- Rich feature space: marketing events, weather, pricing changes, app version, and other external variables that drive demand beyond what historical patterns capture.
- Abundant data: years of high-frequency data (15-minute intervals) providing tens of thousands of training rows.
- Non-linear effects: interactions between features that linear regression cannot capture — e.g., a marketing campaign during holiday season has a different effect than the sum of the campaign effect and the holiday effect.
- Cold-start series: a new queue with no history can be forecasted by a global model using features from similar existing queues. Classical methods require per-series history.
When classical wins
Classical methods are the right choice when:
- Limited data: a new queue with 3–6 months of daily data. ETS and ARIMA are designed for exactly this scenario. ML will overfit.
- Single series: one queue to forecast, no cross-series learning opportunity. The overhead of ML feature engineering and pipeline maintenance is not justified.
- Interpretability required: the forecast must be explained to non-technical stakeholders or pass governance review. ETS decomposition ("here is the trend, here is the seasonal pattern, here is the forecast") is universally understandable.
- Minimal features: no external regressors, no cross-series structure. The problem is purely univariate temporal pattern extraction — classical methods' core competency.
- Operational simplicity: a small WFM team without data engineering resources cannot maintain ML pipelines, feature stores, and model retraining infrastructure. A monthly ETS refit is operationally sustainable.
The WFM practitioner's decision tree
- How many series? If < 10 with no shared structure, start classical. If 50+, evaluate ML.
- External features available? If rich features exist (marketing, weather, events), ML has an advantage. If purely historical data, classical is sufficient.
- Data volume per series? If < 500 observations, classical. If 5,000+, ML is viable.
- Organizational capability? If the WFM team has data engineering and ML expertise (or access to it), ML is an option. If not, classical with good process discipline will outperform poorly maintained ML.
- Start with the baseline. Run seasonal naive and well-tuned ETS. If a method cannot beat these, the problem is process, not methodology. Only then evaluate whether ML adds value on top.[5]
Common mistakes
- Deploying ML without a classical baseline. If you cannot show that ML beats ETS on your data, you are paying for complexity without benefit.
- Over-engineering features. Adding 200 features to a LightGBM model does not guarantee better accuracy. Start with the obvious features (time, day of week, lag values) and add complexity incrementally, measuring improvement at each step.
- Ignoring the maintenance cost. ML models degrade over time as data distributions shift. A model trained on 2023 data may underperform by mid-2024. Classical models degrade too, but their simplicity makes diagnosis and refit easier.
- Confusing training accuracy with production accuracy. ML models can achieve near-zero error on training data through overfitting. Only holdout (out-of-time) accuracy matters. Cross-validation with temporal splits (never future data leaking into training) is mandatory.
- Assuming neural networks are always better than gradient boosting. For tabular WFM data, gradient boosting (LightGBM, XGBoost) consistently matches or beats neural networks. The transformer revolution in NLP and vision has not (yet) translated to dominance on tabular forecasting problems.[6]
Relationship to other pages
- Forecasting Methods — parent page covering the full method taxonomy
- Exponential Smoothing — the dominant classical method family
- ARIMA Models — the second classical pillar
- External Regressors in WFM Forecasting — the feature space that gives ML its advantage
- Forecast Combination — ensembling methods that blend classical and ML outputs
- Forecast Accuracy Metrics — the metrics for comparing methods (MASE is preferred for cross-method comparison)
- ↑ Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2018). The M4 Competition: Results, Findings, Conclusion and Way Forward. International Journal of Forecasting, 34(4), 802–808.
- ↑ Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). M5 Accuracy Competition: Results, Findings, and Conclusions. International Journal of Forecasting, 38(4), 1346–1364.
- ↑ Januschowski, T., Gasthaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke-Schneider, M., and Callot, L. (2020). Criteria for Classifying Forecasting Methods. International Journal of Forecasting, 36(1), 167–177.
- ↑ Hyndman, R.J. and Athanasopoulos, G. (2021). Forecasting: Principles and Practice. 3rd ed. OTexts. Chapter 12: Advanced Forecasting Methods.
- ↑ Petropoulos, F., Apiletti, D., Assimakopoulos, V., et al. (2022). Forecasting: Theory and Practice. International Journal of Forecasting, 38(3), 845–1130.
- ↑ Shwartz-Ziv, R. and Arber, A. (2022). Tabular Data: Deep Learning is Not All You Need. Information Fusion, 81, 84–90.
