🤖 AI Summary
Existing approaches struggle to efficiently and accurately predict the energy consumption of large language model inference on multi-GPU systems, either relying on costly empirical measurements or failing to capture multi-GPU energy characteristics. This work proposes the first end-to-end framework that integrates three key innovations: einsum-based modeling of model architecture, load-imbalance-aware energy modeling for Mixture-of-Experts (MoE), and an empirical-driven communication energy model for multi-GPU setups. The framework enables accurate energy prediction and exploration of energy-optimal configurations without requiring full deployment. Evaluated on Llama3 and Qwen3-MoE, it achieves prediction errors of only 9.25%–13.19%, reveals up to 1.47× (prefill) and 52.9× (decode) differences in energy efficiency across configurations, and correctly identifies the optimal compute-communication overlap strategy.
📝 Abstract
We present EnergyLens, an end-to-end framework for energy-aware large language model (LLM) inference optimization. As LLMs scale, predicting and reducing their energy footprint has become critical for sustainability and datacenter operations, yet existing approaches either require production-level code and expensive profiling or fail to accurately capture multi-GPU energy behavior. As a result, practitioners lack tools for deciding which optimizations to prioritize and for selecting among existing deployment configurations when exhaustive profiling is impractical. EnergyLens addresses this gap with an intuitive einsum-based interface that captures LLM specifications including fusion, parallelism, and compute-communication overlap, combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, achieving mean absolute percentage errors (MAPEs) between 9.25% and 13.19% for multi-GPU prefill and decode energy, and 12.97% across SM allocations for Megatron-style overlap. Our energy-driven exploration reveals up to 1.47x and 52.9x energy variation across configurations in prefill and decode efficiency and motivates distributed serving. We further show that compute-communication overlap is difficult to optimize with intuition alone, but EnergyLens correctly identifies Pareto-optimal overlap configurations.