🤖 AI Summary
Existing benchmarks for probabilistic forecasting in power systems struggle to jointly evaluate physical security constraints and distributional fidelity. To address this gap, this work introduces PowerPhase—a large-scale benchmark encompassing six transmission networks with joint forecasts up to 36,964 dimensions—and proposes PowerForge, a novel scenario-based forecasting model. PowerForge incorporates type-specific decoder heads and a causal bridging mechanism across variable groups, coupled with AC power flow solvers to generate physically consistent trajectories. The study further establishes the first evaluation paradigm that simultaneously accounts for safety and fidelity, introducing new metrics such as Safety_mBrier and revealing an inherent trade-off between the two. Experimental results demonstrate that PowerForge achieves the best average ranking across all grids and significantly outperforms eight baselines, effectively balancing safety and predictive accuracy.
📝 Abstract
Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.