Power Forecasting ML — Jade Bourdens

Methodology

The ML Pipeline

Six stages, each building on the last. Click any step to see the logic and code.

Step 01

Data Loading & EDA

Explore the raw signal — seasonal bands and daily cycles.

Step 02

Time Series Cross-Validation

5-fold walk-forward split — test on each successive year.

Step 03

Feature Engineering

Convert datetime into 8 numerical signals the model can learn.

Step 04

XGBoost Training

Gradient-boosted trees with early stopping — depth 3, lr 0.01.

Step 05

Lag Features

Year-ago values (lag1/2/3) to capture long-term trends.

Step 06

Future Forecasting

Retrain on all data — generate 8,616 hourly predictions ahead.

Step 01 — Data Loading & Exploratory Analysis

The raw PJME data revealed a clear two-band pattern: consumption oscillating between a 20–40 GW lower band and a 40–60 GW upper band, repeating reliably year after year. A one-week zoom confirmed daily peaks (35–44 GW) and overnight valleys (27–33 GW), with weekday vs weekend differences visible.

import kagglehub, pandas as pd
# Load PJME hourly energy data from Kaggle
df = pd.read_csv(pjme_file)
df = df.set_index('Datetime')
df.index = pd.to_datetime(df.index)
# → 145,366 hourly observations, 2002–2018
df.plot(style='.', figsize=(15, 5), title='PJME Energy Use in MW')

Step 02 — Time Series Cross-Validation

Standard random cross-validation leaks future data into training. TimeSeriesSplit creates 5 walk-forward folds — each trains on all past data and tests on one unseen future year. A 24-hour gap prevents the model learning from data immediately adjacent to the test boundary.

from sklearn.model_selection import TimeSeriesSplit

tss = TimeSeriesSplit(
    n_splits=5,
    test_size=24*365*1,  # 1 year of hours per test fold
    gap=24              # 24-hr buffer prevents leakage
)
# Fold 1: train 2002–2013  →  test 2014
# Fold 2: train 2002–2014  →  test 2015 ...

Step 03 — Feature Engineering

XGBoost can't interpret datetime strings — it needs numbers. This step decomposes each timestamp into 8 numerical signals encoding time-of-day, day-of-week, and seasonal context. Boxplots confirmed the patterns: consumption peaks around 6 PM daily and spikes in winter and summer months.

def create_features(df):
    df = df.copy()
    df['hour']       = df.index.hour         # 0–23
    df['dayofweek']  = df.index.dayofweek    # 0 = Monday
    df['quarter']    = df.index.quarter
    df['month']      = df.index.month
    df['year']       = df.index.year
    df['dayofyear']  = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    df['weekofyear'] = df.index.isocalendar().week
    return df

Step 04 — XGBoost Training

XGBoost handles engineered temporal features efficiently and surfaces feature importances natively. Depth-3 trees keep individual learners simple; 1,000 estimators with 50-round early stopping automatically finds the optimal ensemble size and prevents overfitting.

reg = xgb.XGBRegressor(
    base_score=0.5,
    booster='gbtree',
    n_estimators=1000,
    early_stopping_rounds=50,    # stops when no improvement
    objective='reg:squarederror',
    max_depth=3,                # shallow = generalizable
    learning_rate=0.01
)
reg.fit(X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    verbose=100)

Step 05 — Lag Features

The baseline model treats each hour independently. Lag features add historical memory: "what was consumption exactly 1, 2, and 3 years ago at this same hour?" This lets the model detect year-over-year trends and anomalous deviations — without any external weather data.

def add_lags(df):
    target_map = df['PJME_MW'].to_dict()
    # Same hour from 1, 2, and 3 years prior
    df['lag1'] = (df.index - pd.Timedelta('364 days')).map(target_map)
    df['lag2'] = (df.index - pd.Timedelta('728 days')).map(target_map)
    df['lag3'] = (df.index - pd.Timedelta('1092 days')).map(target_map)
    return df
# Updated FEATURES += ['lag1', 'lag2', 'lag3']
# Cross-validated RMSE improved significantly

Step 06 — Future Forecasting

With the final model trained on all available data, an empty hourly DatetimeIndex is created for August 2018 through August 2019. Feature engineering and lag lookups populate each row, then the model predicts every hour — producing 8,616 forward forecasts with realistic seasonal cycles.

# Build hourly future timestamp index
future = pd.date_range('2018-08-03', '2019-08-01', freq='1h')
future_df = pd.DataFrame(index=future)
future_df['isFuture'] = True

# Merge → features + lags → predict
df_and_future = pd.concat([df, future_df])
df_and_future = create_features(df_and_future)
df_and_future = add_lags(df_and_future)
future_w_features['pred'] = reg.predict(future_w_features[FEATURES])
# 8,616 hourly predictions generated ✓

Findings

Key Insights

What the data and model revealed about power consumption patterns.

Dual-band seasonal pattern

Consumption oscillates reliably between two bands year after year — a 20–40 GW moderate band and a 40–60 GW peak band driven by summer cooling and winter heating.

Hour of day dominates

The model assigned ~92% relative importance to the hour feature. Daily cycles — morning rise, 6 PM peak, overnight trough — explain the majority of variance in consumption.

Lag features improve accuracy

Adding year-ago values (lag1, lag2, lag3) improved cross-validated RMSE by giving the model historical anchors — especially valuable for capturing seasonal peak magnitudes.

Holidays cause worst errors

Per-day error analysis showed worst predictions cluster around holidays — days where human behavior breaks the learned weekly and daily patterns.

Power Consumption Forecasting

What This Project Does