ML Pipeline Architecture for WFM

From WFM Labs

ML Pipeline Architecture for WFM describes how to build production machine learning systems for workforce management forecasting and optimization. Moving from a Jupyter notebook prototype to a reliable, monitored production system is where most WFM ML initiatives fail — not because the models are bad, but because the infrastructure around them does not exist.

The Maturity Spectrum

ML pipeline architecture: feature store to monitoring
Level Description Tools Appropriate When
0 — Manual Analyst runs notebook, copies forecast into WFM tool Jupyter, Excel Proof of concept, < 5 forecast series
1 — Scripted Python script on cron, outputs to shared drive or database Python + cron + CSV/SQL Small team, < 20 series, weekly reforecast
2 — Orchestrated Workflow orchestrator manages training, validation, serving Airflow/Prefect + MLflow + SQL Growing team, 20-100 series, daily reforecast
3 — Platform Full MLOps with feature store, model registry, monitoring Feast + MLflow + Airflow + Grafana ML team, 100+ series, real-time serving

Most WFM teams should target Level 1-2. Level 3 is justified only when ML models drive real-time intraday decisions or when model count exceeds what a team can manually track.

Architecture Components

Feature Store

A feature store centralizes feature computation so that training and serving use identical feature logic. Without one, training/serving skew is the most common source of silent model degradation.

Core WFM features to centralize:

Feature Group Features Computation
Calendar day_of_week, month, week_of_year, is_holiday, holiday_distance, is_month_end Static calendar table + holiday API
Lag volume_lag_1d, volume_lag_7d, volume_lag_28d, volume_lag_364d Computed from fact table, requires point-in-time correctness
Rolling rolling_mean_7d, rolling_std_7d, rolling_mean_28d Window functions on fact table
External marketing_spend, product_launch_flag, weather_temp, billing_cycle_day External API ingestion pipeline
Derived aht_trend_28d, shrinkage_rolling_14d, channel_mix_7d Computed from multiple source tables

Point-in-time correctness: When computing features for a historical training example at date T, you must only use data available as of date T. Using future data (even accidentally via a rolling window that includes T+1) creates leakage and inflated validation metrics.

Tool options:

  • Feast (open-source): Offline (batch) + online (real-time) feature serving. Good for teams already using Python.
  • dbt + SQL views: Simpler approach — define features as dbt models, materialize as tables. Sufficient for batch-only serving.
  • Custom SQL views: Minimum viable approach. Define feature queries in version-controlled SQL files.

Training Pipeline

The training pipeline runs on a schedule (or triggered by data arrival) and produces trained models stored in a model registry.

Pipeline steps:

  1. Data validation — Check for missing dates, outlier volumes, schema changes
  2. Feature computation — Run feature store transformations
  3. Training data assembly — Join features with target variable, apply time-based train/val/test split
  4. Model training — Train each base model (and ensemble if applicable)
  5. Model validation — Evaluate on holdout, compare against production model
  6. Model registration — Store model artifact, metadata, and metrics in registry
  7. Promotion decision — Auto-promote if new model beats production by threshold, otherwise flag for review

Airflow DAG structure:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
    'execution_timeout': timedelta(hours=2),
}

with DAG(
    'wfm_forecast_training',
    default_args=default_args,
    schedule_interval='0 2 * * 0',  # Weekly, Sunday 2 AM
    start_date=datetime(2025, 1, 1),
    catchup=False,
) as dag:

    validate_data = PythonOperator(
        task_id='validate_source_data',
        python_callable=run_data_validation,
    )

    compute_features = PythonOperator(
        task_id='compute_features',
        python_callable=run_feature_pipeline,
    )

    train_models = PythonOperator(
        task_id='train_base_models',
        python_callable=run_training,
    )

    validate_models = PythonOperator(
        task_id='validate_models',
        python_callable=run_model_validation,
    )

    register = PythonOperator(
        task_id='register_model',
        python_callable=register_if_improved,
    )

    validate_data >> compute_features >> train_models >> validate_models >> register

Model Registry

The model registry tracks every trained model with its metadata, making rollback and comparison possible.

What to store per model version:

  • Model artifact (serialized model file)
  • Training data hash (reproducibility)
  • Hyperparameters
  • Validation metrics (MAPE, MAE, RMSE, bias)
  • Feature importance scores
  • Training timestamp and duration
  • Promotion status (staging / production / archived)

MLflow implementation:

import mlflow
import mlflow.sklearn

with mlflow.start_run(run_name="lgbm_daily_volume_v23"):
    mlflow.log_params(params)
    mlflow.log_metric("val_mape", val_mape)
    mlflow.log_metric("val_mae", val_mae)
    mlflow.log_metric("val_bias", val_bias)
    mlflow.sklearn.log_model(model, "model")
    mlflow.log_artifact("feature_importance.csv")

Serving: Batch vs Real-Time

Pattern Use Case Latency Architecture
Batch prediction Daily/weekly volume forecast, long-range capacity plan Minutes acceptable Scheduled job writes predictions to database table; WFM tool reads table
Micro-batch Intraday reforecast every 30-60 minutes Seconds Triggered job recomputes forecast with latest actuals
Real-time Per-contact routing decisions, real-time staffing alerts Milliseconds Model served via REST API (FastAPI, SageMaker endpoint)

For most WFM teams, batch prediction is sufficient. Daily forecasts generated at 2 AM, loaded into the WFM tool by 6 AM. Intraday reforecasts triggered every 30 minutes during operating hours.

Batch serving pipeline:

def generate_daily_forecast():
    """Run as scheduled job: daily at 02:00."""
    # Load production model
    model = mlflow.sklearn.load_model("models:/daily_volume/Production")

    # Compute features for forecast horizon
    features = compute_forecast_features(
        horizon_days=28,
        as_of_date=datetime.today()
    )

    # Generate predictions
    predictions = model.predict(features)

    # Write to forecast table
    write_forecast_to_db(predictions, model_version=model.version)

    # Validate predictions (sanity checks)
    validate_predictions(predictions)

Monitoring

Production ML systems degrade silently. Without monitoring, a model can produce increasingly bad forecasts for weeks before anyone notices.

Three layers of monitoring:

1. Data monitoring (detect before model runs):

  • Source data freshness — did today's data arrive?
  • Schema validation — did column types or names change?
  • Distribution drift — has the input distribution shifted beyond thresholds?
  • Missing value rates — sudden increase in nulls?

2. Model monitoring (detect after model runs):

  • Prediction distribution — are forecasts within historical bounds?
  • Feature importance stability — did feature rankings shift dramatically?
  • Inference latency — is the model slower than expected?

3. Performance monitoring (detect after actuals arrive):

  • Forecast accuracy (MAPE, MAE) by segment, horizon, day-of-week
  • Bias detection — is the model systematically over/under-forecasting?
  • Accuracy degradation trend — rolling 4-week accuracy declining?
  • Comparison to naive baseline — does the model still beat seasonal naive?

Alert thresholds (contact center volume forecasting):

Metric Warning Critical
Daily MAPE > 10% for 3 consecutive days > 15% for any day
Weekly bias > ±3% sustained bias over 2 weeks > ±5% any week
Data freshness > 2 hours late > 6 hours late
Feature drift KL divergence > 0.1 on any feature KL divergence > 0.3

Dashboard stack: Grafana (visualization) + Prometheus (metrics collection) + custom Python scripts writing metrics. Alternatively, Weights & Biases for ML-specific monitoring.

Architecture Patterns by Maturity

Level 1: Scripted (Jupyter → Cron)

[Jupyter Notebook] → [Python script] → [cron job]
                                           ↓
                                    [CSV / database table]
                                           ↓
                                    [WFM tool import]

Components:

  • Python script extracted from notebook
  • cron scheduling (Linux) or Task Scheduler (Windows)
  • Results written to CSV or database table
  • Manual import into WFM tool
  • Error notification via email or Slack webhook

Pros: Fast to implement, no infrastructure dependencies. Cons: No model versioning, no monitoring, no rollback, single point of failure.

Level 2: Orchestrated (Airflow + MLflow)

[Airflow Scheduler]
    ↓
[Data Validation] → [Feature Pipeline] → [Training] → [Validation]
                                                            ↓
                                                    [MLflow Registry]
                                                            ↓
                                                    [Batch Prediction Job]
                                                            ↓
                                                    [Forecast Database]
                                                            ↓
                                                    [WFM Tool API / Import]

[Grafana Dashboard] ← [Accuracy Metrics Pipeline]

Components:

  • Apache Airflow or Prefect for orchestration
  • MLflow for experiment tracking and model registry
  • PostgreSQL for feature store (dbt models)
  • Grafana + Prometheus for monitoring
  • Slack/PagerDuty for alerts

Pros: Reproducible, versioned, monitored, supports multiple models. Cons: Significant setup time (2-4 weeks for initial build), requires infrastructure knowledge.

Level 3: Full MLOps

Adds to Level 2:

  • Feast or Tecton for online/offline feature store
  • A/B testing infrastructure (shadow mode, canary deployment)
  • Automated retraining triggered by drift detection
  • CI/CD for model code (GitHub Actions → train → validate → deploy)
  • Infrastructure as code (Terraform/Pulumi for cloud resources)

Rarely justified for WFM. This level makes sense when: (a) models serve real-time routing decisions, (b) model count exceeds 50, or (c) regulatory requirements demand full audit trails.

Build vs Buy Decision

Factor Build Custom ML Pipeline Use Vendor Built-in ML
Data volume > 2 years history, multiple data sources < 2 years, single WFM tool data
Team skills Data engineer + ML engineer on staff WFM analysts only
Accuracy requirement Every 1% MAPE matters (large centers) Good enough is sufficient
Update frequency Daily or intraday reforecast Weekly reforecast acceptable
Vendor capability Vendor ML is a black box or underperforms Vendor ML is competitive
Budget Engineering time available Minimal engineering budget

Hybrid approach (recommended for most): Use the WFM vendor's built-in forecasting as one model in an ensemble. Build a lightweight ML pipeline (Level 1-2) for the additional models. Combine outputs via optimized weights.

Data Validation Patterns

Data quality issues are the most common cause of ML model failure in WFM — more common than bad models or wrong features.

Pre-Training Validation

Run before every training pipeline execution:

Check Implementation Action on Failure
Completeness Count rows per day; flag if any day has < 80% of expected intervals Halt training; investigate data source
Recency Verify most recent data is within 24 hours Halt training; check ETL pipeline
Range Volume per interval within [0, 3x historical max] Flag outliers; decide whether to clip or exclude
AHT bounds AHT within [30s, 3600s] (configurable by queue) Clip or exclude extreme values
Holiday flags Verify holiday calendar includes next 12 months Warning; update holiday table
Schema Column names and types match expected schema Halt training; investigate upstream change
def validate_training_data(df, config):
    """Run pre-training data validation checks."""
    issues = []

    # Completeness: check for missing dates
    expected_dates = pd.date_range(config['start_date'], config['end_date'])
    actual_dates = df['date'].unique()
    missing = set(expected_dates) - set(actual_dates)
    if len(missing) > 0:
        issues.append(f"CRITICAL: {len(missing)} missing dates: {sorted(missing)[:5]}...")

    # Recency
    latest = df['date'].max()
    staleness = (pd.Timestamp.now() - latest).days
    if staleness > 2:
        issues.append(f"WARNING: Data is {staleness} days stale")

    # Volume range
    extreme_high = df[df['volume'] > config['max_volume'] * 3]
    if len(extreme_high) > 0:
        issues.append(f"WARNING: {len(extreme_high)} intervals exceed 3x max volume")

    # AHT range
    aht_outliers = df[(df['aht'] < 30) | (df['aht'] > 3600)]
    if len(aht_outliers) > 0:
        issues.append(f"WARNING: {len(aht_outliers)} intervals with AHT outside [30s, 3600s]")

    return issues

Post-Prediction Validation

Run after every prediction to catch model failures before forecasts reach the WFM tool:

  • Predictions must be non-negative
  • Daily total within ±40% of trailing 4-week same-day-of-week average
  • No single interval exceeds 5x the daily average interval (catches spike artifacts)
  • Intraday shape correlation with historical same-day-of-week > 0.8 (catches shape distortion)

Common Failure Modes

  1. Training/serving skew: Features computed differently in training vs production. Fix: centralize feature logic (feature store or shared SQL).
  2. Stale models: Model trained once, never retrained as patterns change. Fix: automated retraining schedule with performance gates.
  3. No rollback: Bad model deployed with no way to revert. Fix: model registry with promotion stages.
  4. Silent degradation: Model accuracy drops but no one knows. Fix: monitoring dashboards with automated alerts.
  5. Overengineering: Building Level 3 infrastructure for a 50-seat center. Fix: match architecture to actual need.
  6. Feature leakage: Using information that would not be available at prediction time. Most common: including the target variable (or a derivative) as a feature, or computing rolling statistics that include future data. Fix: strict point-in-time feature computation.
  7. Calendar feature gaps: Forgetting to update the holiday table for the new year, causing the model to miss holiday effects entirely. Fix: holiday table validation in the pre-training checks.

Getting Started: Week-by-Week Plan

For WFM teams with no ML infrastructure, here is a realistic plan to reach Level 1 (scripted pipeline):

Week 1: Environment setup

  • Install Python (Anaconda distribution recommended for data science packages)
  • Set up version control (Git + GitHub/GitLab)
  • Create project structure:
wfm-forecast/
├── data/           # Raw and processed data
├── features/       # Feature computation scripts
├── models/         # Model training scripts
├── evaluation/     # Accuracy evaluation scripts
├── output/         # Forecast output files
├── config.yaml     # Configuration (data paths, parameters)
└── README.md       # Documentation

Week 2: Data pipeline

  • Write SQL or Python script to extract historical data from ACD/WFM tool
  • Build feature computation script (calendar features, lag features, rolling statistics)
  • Validate: does the feature set for date T only use data available before date T?

Week 3: Model training

  • Implement train/validation/test split (chronological)
  • Train 2-3 models (start with LightGBM + linear regression + seasonal naive)
  • Evaluate on validation set (MAPE, MAE, bias)
  • Select best model or build simple ensemble (equal weights)

Week 4: Production script

  • Extract model training into a standalone Python script (not a notebook)
  • Add logging (what ran, when, what accuracy)
  • Add basic error handling (data not found, model fails validation)
  • Write forecast output to CSV or database table
  • Set up cron job or Windows Task Scheduler to run daily

Week 5-6: Monitoring and hardening

  • Build accuracy tracking: script that compares yesterday's forecast to actuals, logs metrics
  • Set up email or Slack alert for accuracy threshold violations
  • Document: what the pipeline does, how to run it manually, how to troubleshoot common errors
  • Train a backup person to operate the pipeline

Total investment: 40-60 hours of analyst time over 6 weeks. No infrastructure cost if using local machine + open-source tools.

See Also