🤖 AI Summary
Energy efficiency of code generated by large language models (LLMs) remains unexamined despite growing deployment in software development. Method: We introduce the first reproducible, hardware-level energy efficiency benchmark for LLM-generated code, evaluating 20 mainstream models on 878 LeetCode problems under standardized execution environments with fine-grained power monitoring. Contribution/Results: Our analysis reveals substantial energy inefficiency: Grok-2’s code consumes up to 450× more energy than human-optimal solutions on dynamic programming tasks; human reference implementations are 17–21% more energy-efficient than DeepSeek-v3 and GPT-4o, and over 100% more efficient than Grok-2 and Gemini-1.5-Pro. We identify coupled effects of model architecture and algorithmic problem class on energy consumption and establish consistent cross-model energy-efficiency rankings. This work provides foundational empirical evidence and a rigorous methodology for green AI development and evaluation.
📝 Abstract
As the quality of code generated by Large Language Models (LLMs) improves, their adoption in the software industry for automated code generation continues to grow. Researchers primarily focus on enhancing the functional correctness of the generated code while commonly overlooking its energy efficiency and environmental impact. This paper investigates the energy efficiency of the code generated by 20 popular LLMs for 878 programming problems of varying difficulty levels and diverse algorithmic categories selected from the LeetCode platform by comparing them against canonical human-written solutions. Although LLMs can produce functionally correct results in most cases, our findings show that the performance and energy efficiency of LLM-produced solutions are often far below those of human-written solutions. Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code, whereas Grok-2 and Gemini-1.5-Pro are among the least energy-efficient models. On average, human-generated canonical solutions are approximately 1.17 times more energy efficient than DeepSeek-v3, 1.21 times more energy efficient than GPT-4o, and over 2 times more energy efficient than Grok-2 and Gemini-1.5-Pro. For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions.