Predictive Analytics with Python & XGBoost
A full ML pipeline to predict hourly electricity demand across the PJM East region — helping operators anticipate overloads, optimize generation, and balance supply with demand.
Starting from raw hourly energy data, this pipeline builds progressively more accurate forecasts: first learning daily and seasonal cycles, then layering in year-over-year lag signals, and finally generating a full 12-month forward forecast. Every step is validated with time-aware cross-validation so no future information leaks into training.
Six stages, each building on the last. Click any step to see the logic and code.
The raw PJME data revealed a clear two-band pattern: consumption oscillating between a 20–40 GW lower band and a 40–60 GW upper band, repeating reliably year after year. A one-week zoom confirmed daily peaks (35–44 GW) and overnight valleys (27–33 GW), with weekday vs weekend differences visible.
Standard random cross-validation leaks future data into training. TimeSeriesSplit creates 5 walk-forward folds — each trains on all past data and tests on one unseen future year. A 24-hour gap prevents the model learning from data immediately adjacent to the test boundary.
XGBoost can't interpret datetime strings — it needs numbers. This step decomposes each timestamp into 8 numerical signals encoding time-of-day, day-of-week, and seasonal context. Boxplots confirmed the patterns: consumption peaks around 6 PM daily and spikes in winter and summer months.
XGBoost handles engineered temporal features efficiently and surfaces feature importances natively. Depth-3 trees keep individual learners simple; 1,000 estimators with 50-round early stopping automatically finds the optimal ensemble size and prevents overfitting.
The baseline model treats each hour independently. Lag features add historical memory: "what was consumption exactly 1, 2, and 3 years ago at this same hour?" This lets the model detect year-over-year trends and anomalous deviations — without any external weather data.
With the final model trained on all available data, an empty hourly DatetimeIndex is created for August 2018 through August 2019. Feature engineering and lag lookups populate each row, then the model predicts every hour — producing 8,616 forward forecasts with realistic seasonal cycles.
RMSE (Root Mean Squared Error) measures average prediction error in megawatts — the same unit as the target, making it directly interpretable by grid operators.
XGBoost surfaces which time features it relied on most. Hour of day dominates — the model learned that when it is matters more than anything else.
What the data and model revealed about power consumption patterns.
Consumption oscillates reliably between two bands year after year — a 20–40 GW moderate band and a 40–60 GW peak band driven by summer cooling and winter heating.
The model assigned ~92% relative importance to the hour feature. Daily cycles — morning rise, 6 PM peak, overnight trough — explain the majority of variance in consumption.
Adding year-ago values (lag1, lag2, lag3) improved cross-validated RMSE by giving the model historical anchors — especially valuable for capturing seasonal peak magnitudes.
Per-day error analysis showed worst predictions cluster around holidays — days where human behavior breaks the learned weekly and daily patterns.
End-to-end Python ML pipeline from data ingestion to future forecast generation.