Citi Bike Station Demand & Capacity Planning

Citi Bike operators face a constant rebalancing problem: stations drain empty during rush hours, overflow on off-peak periods, and certain locations are structurally imbalanced no matter what. This project builds an end-to-end ML pipeline that turns a year of raw trip data into three concrete operational outputs — a demand forecast, a risk flag, and a ranked priority list for capacity intervention.

Built for the Machine Learning course at Óbuda University, Budapest · May 2025.

The Problem

Bike-sharing rebalancing is expensive and reactive. Trucks move bikes only after stations have already failed. The goal here was to shift that to predictive — give operators the tools to act before failure, not after.

Three tasks, each building on the last:

Data & Scale

The dataset is the full 2025 Citi Bike trip history, pulled from their public S3 bucket.

The pipeline downloads monthly ZIPs, parses CSVs with dtype pinning into Parquet, then runs DuckDB SQL views over all files — meaning 10M+ rows are queried analytically without ever loading into RAM. That was the key architectural decision that made this runnable on a laptop.

The train/val/test split is strictly time-based:

Jan 2025 ──────────────────────────── Dec 2025
│      TRAIN (75%)       │  VAL 12.5%  │  TEST 12.5%  │

Feature Engineering

Four families of features, all computed with 8-core multiprocessing on per-station groups:

Lag features — dep_lag_1h, 2h, 3h, 6h, 12h, 24h (yesterday), 48h, 168h (last week same hour)
Rolling stats — 3h / 6h / 24h window mean and standard deviation, applied on shift(1) to prevent data leakage
Cyclical time encoding — hour_sin/cos and dow_sin/cos to preserve the circular structure of time
Flow & capacity signals — net_flow (departures − arrivals), cum_net_flow (cumulative drain), demand_ratio (departures / capacity)

Task 1 — Demand Forecasting

A LightGBM regressor predicting average departures 2–6 hours ahead. Early stopping on validation kept the model from overfitting.

LGBMRegressor(
    n_estimators=300,       # early stopping @ 30
    num_leaves=31,
    learning_rate=0.10,
    subsample=0.5,
    colsample_bytree=0.5,
)

Results:

The near-identical val and test numbers confirm no overfitting. Feature importance analysis showed dep_lag_1h and dep_lag_24h as the dominant signals — demand is highly cyclical.

Task 2 — Risk Classification

A station is labeled high-risk if any of these conditions hold in a given hour:

Departures ≥ 75th percentile (near-empty risk)
Demand ratio ≥ 0.80 (near-capacity)

One critical design choice: net_flow, demand_ratio, and cum_net_flow were excluded from classifier features. They derive directly from the label definition — including them would be leakage.

Threshold was tuned on the validation set (final: 0.56), which lifted F1(risk=1) by ~4%.

Results:

Task 3 — Capacity Prioritization

The final task takes predictions from Tasks 1 and 2 and produces a ranked list of stations that need capacity intervention most urgently.

Each station receives a composite score:

Score = 0.35 × norm(avg_demand)
      + 0.35 × norm(risk_freq)
      + 0.20 × norm(avg_risk_prob)
      + 0.10 × norm(drain_score)

The output is a station_priority_ranking.csv alongside bar and scatter plots for operator review. Top finding: stations 5788 (score: 0.909) and 5779 (score: 0.837) are the most critical bottlenecks in the system — chronic drain, high predicted demand, and near-constant risk flags.

Key Takeaways

Time patterns dominate. Lag features at 1h, 24h, and 168h carry most of the predictive signal. Demand is highly cyclical — the model essentially learns that tomorrow at 8 AM will look like today at 8 AM.

Imbalance is structural. The highest-risk stations cluster around transit hubs and tourist zones. Their net flow drain persists regardless of time of day — these aren't random fluctuations, they're architectural problems.

RAM-efficient pipelines matter. DuckDB over Parquet means 24 GB of data is queryable on a standard laptop. Combined with multiprocessing for feature engineering, the full pipeline runs without a GPU or cloud compute.

Predictions should become decisions. The composite prioritization score was designed to be operator-readable — a ranked list with a clear formula, not just a model artifact.

Stack

Python · DuckDB · LightGBM · Pandas · NumPy · scikit-learn · Matplotlib · Multiprocessing

View Source Code