Machine Learning · Time Series · Energy Forecasting

Power Consumption Forecasting

Predictive Analytics with Python & XGBoost

XGBoost Time Series CV Feature Engineering Lag Features RMSE Evaluation Future Forecasting
~145KHourly Observations
2002–2019Data Span
3,722 MWFinal RMSE
~11%Error Rate

What This Project Does

A full ML pipeline to predict hourly electricity demand across the PJM East region — helping operators anticipate overloads, optimize generation, and balance supply with demand.

Starting from raw hourly energy data, this pipeline builds progressively more accurate forecasts: first learning daily and seasonal cycles, then layering in year-over-year lag signals, and finally generating a full 12-month forward forecast. Every step is validated with time-aware cross-validation so no future information leaks into training.

Python Scikit-learn XGBoost Pandas Matplotlib Seaborn NumPy KaggleHub

The ML Pipeline

Six stages, each building on the last. Click any step to see the logic and code.

Step 01
Data Loading & EDA
Explore the raw signal — seasonal bands and daily cycles.
Step 02
Time Series Cross-Validation
5-fold walk-forward split — test on each successive year.
Step 03
Feature Engineering
Convert datetime into 8 numerical signals the model can learn.
Step 04
XGBoost Training
Gradient-boosted trees with early stopping — depth 3, lr 0.01.
Step 05
Lag Features
Year-ago values (lag1/2/3) to capture long-term trends.
Step 06
Future Forecasting
Retrain on all data — generate 8,616 hourly predictions ahead.

Step 01 — Data Loading & Exploratory Analysis

The raw PJME data revealed a clear two-band pattern: consumption oscillating between a 20–40 GW lower band and a 40–60 GW upper band, repeating reliably year after year. A one-week zoom confirmed daily peaks (35–44 GW) and overnight valleys (27–33 GW), with weekday vs weekend differences visible.

import kagglehub, pandas as pd # Load PJME hourly energy data from Kaggle df = pd.read_csv(pjme_file) df = df.set_index('Datetime') df.index = pd.to_datetime(df.index) # → 145,366 hourly observations, 2002–2018 df.plot(style='.', figsize=(15, 5), title='PJME Energy Use in MW')

Step 02 — Time Series Cross-Validation

Standard random cross-validation leaks future data into training. TimeSeriesSplit creates 5 walk-forward folds — each trains on all past data and tests on one unseen future year. A 24-hour gap prevents the model learning from data immediately adjacent to the test boundary.

from sklearn.model_selection import TimeSeriesSplit tss = TimeSeriesSplit( n_splits=5, test_size=24*365*1, # 1 year of hours per test fold gap=24 # 24-hr buffer prevents leakage ) # Fold 1: train 2002–2013 → test 2014 # Fold 2: train 2002–2014 → test 2015 ...

Step 03 — Feature Engineering

XGBoost can't interpret datetime strings — it needs numbers. This step decomposes each timestamp into 8 numerical signals encoding time-of-day, day-of-week, and seasonal context. Boxplots confirmed the patterns: consumption peaks around 6 PM daily and spikes in winter and summer months.

def create_features(df): df = df.copy() df['hour'] = df.index.hour # 0–23 df['dayofweek'] = df.index.dayofweek # 0 = Monday df['quarter'] = df.index.quarter df['month'] = df.index.month df['year'] = df.index.year df['dayofyear'] = df.index.dayofyear df['dayofmonth'] = df.index.day df['weekofyear'] = df.index.isocalendar().week return df

Step 04 — XGBoost Training

XGBoost handles engineered temporal features efficiently and surfaces feature importances natively. Depth-3 trees keep individual learners simple; 1,000 estimators with 50-round early stopping automatically finds the optimal ensemble size and prevents overfitting.

reg = xgb.XGBRegressor( base_score=0.5, booster='gbtree', n_estimators=1000, early_stopping_rounds=50, # stops when no improvement objective='reg:squarederror', max_depth=3, # shallow = generalizable learning_rate=0.01 ) reg.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], verbose=100)

Step 05 — Lag Features

The baseline model treats each hour independently. Lag features add historical memory: "what was consumption exactly 1, 2, and 3 years ago at this same hour?" This lets the model detect year-over-year trends and anomalous deviations — without any external weather data.

def add_lags(df): target_map = df['PJME_MW'].to_dict() # Same hour from 1, 2, and 3 years prior df['lag1'] = (df.index - pd.Timedelta('364 days')).map(target_map) df['lag2'] = (df.index - pd.Timedelta('728 days')).map(target_map) df['lag3'] = (df.index - pd.Timedelta('1092 days')).map(target_map) return df # Updated FEATURES += ['lag1', 'lag2', 'lag3'] # Cross-validated RMSE improved significantly

Step 06 — Future Forecasting

With the final model trained on all available data, an empty hourly DatetimeIndex is created for August 2018 through August 2019. Feature engineering and lag lookups populate each row, then the model predicts every hour — producing 8,616 forward forecasts with realistic seasonal cycles.

# Build hourly future timestamp index future = pd.date_range('2018-08-03', '2019-08-01', freq='1h') future_df = pd.DataFrame(index=future) future_df['isFuture'] = True # Merge → features + lags → predict df_and_future = pd.concat([df, future_df]) df_and_future = create_features(df_and_future) df_and_future = add_lags(df_and_future) future_w_features['pred'] = reg.predict(future_w_features[FEATURES]) # 8,616 hourly predictions generated ✓

Results & Evaluation

RMSE (Root Mean Squared Error) measures average prediction error in megawatts — the same unit as the target, making it directly interpretable by grid operators.

Final RMSE
3,722 MW
On held-out test data — approximately 11% of average grid consumption
Average Consumption
32.8 GW
Mean power demand across the full PJME test period
Validation Folds
5 folds
Walk-forward CV — each fold tests on a different unseen year
Forecast Horizon
1 year
8,616 hourly predictions generated beyond the training data

Feature Importance

XGBoost surfaces which time features it relied on most. Hour of day dominates — the model learned that when it is matters more than anything else.

hour
92%
dayofyear
74%
month
68%
year
52%
dayofweek
38%
quarter
24%

Key Insights

What the data and model revealed about power consumption patterns.

01

Dual-band seasonal pattern

Consumption oscillates reliably between two bands year after year — a 20–40 GW moderate band and a 40–60 GW peak band driven by summer cooling and winter heating.

02

Hour of day dominates

The model assigned ~92% relative importance to the hour feature. Daily cycles — morning rise, 6 PM peak, overnight trough — explain the majority of variance in consumption.

03

Lag features improve accuracy

Adding year-ago values (lag1, lag2, lag3) improved cross-validated RMSE by giving the model historical anchors — especially valuable for capturing seasonal peak magnitudes.

04

Holidays cause worst errors

Per-day error analysis showed worst predictions cluster around holidays — days where human behavior breaks the learned weekly and daily patterns.

Technologies Used

End-to-end Python ML pipeline from data ingestion to future forecast generation.

🐍
Python
Core language
XGBoost
Gradient boosting
🐼
Pandas
Data manipulation
📐
Scikit-learn
CV & metrics
📊
Matplotlib
Visualization
🎨
Seaborn
Statistical plots