Evaluating the Energy-Efficiency of the Code Generated by LLMs

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Energy efficiency of code generated by large language models (LLMs) remains unexamined despite growing deployment in software development. Method: We introduce the first reproducible, hardware-level energy efficiency benchmark for LLM-generated code, evaluating 20 mainstream models on 878 LeetCode problems under standardized execution environments with fine-grained power monitoring. Contribution/Results: Our analysis reveals substantial energy inefficiency: Grok-2’s code consumes up to 450× more energy than human-optimal solutions on dynamic programming tasks; human reference implementations are 17–21% more energy-efficient than DeepSeek-v3 and GPT-4o, and over 100% more efficient than Grok-2 and Gemini-1.5-Pro. We identify coupled effects of model architecture and algorithmic problem class on energy consumption and establish consistent cross-model energy-efficiency rankings. This work provides foundational empirical evidence and a rigorous methodology for green AI development and evaluation.

Technology Category

Application Category

📝 Abstract
As the quality of code generated by Large Language Models (LLMs) improves, their adoption in the software industry for automated code generation continues to grow. Researchers primarily focus on enhancing the functional correctness of the generated code while commonly overlooking its energy efficiency and environmental impact. This paper investigates the energy efficiency of the code generated by 20 popular LLMs for 878 programming problems of varying difficulty levels and diverse algorithmic categories selected from the LeetCode platform by comparing them against canonical human-written solutions. Although LLMs can produce functionally correct results in most cases, our findings show that the performance and energy efficiency of LLM-produced solutions are often far below those of human-written solutions. Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code, whereas Grok-2 and Gemini-1.5-Pro are among the least energy-efficient models. On average, human-generated canonical solutions are approximately 1.17 times more energy efficient than DeepSeek-v3, 1.21 times more energy efficient than GPT-4o, and over 2 times more energy efficient than Grok-2 and Gemini-1.5-Pro. For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating energy efficiency of LLM-generated code
Comparing LLM code to human solutions
Identifying performance gaps in energy consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates energy efficiency of LLM-generated code
Compares 20 LLMs against human-written solutions
Identifies DeepSeek-v3 and GPT-4o as most efficient
🔎 Similar Papers
No similar papers found.
M
Md Arman Islam
University at Buffalo, Buffalo, NY , USA
D
Devi Varaprasad Jonnala
University at Buffalo, Buffalo, NY , USA
R
Ritika Rekhi
University at Buffalo, Buffalo, NY , USA
P
Pratik Pokharel
University at Buffalo, Buffalo, NY , USA
S
Siddharth Cilamkoti
University at Buffalo, Buffalo, NY , USA
Asif Imran
Asif Imran
University at Buffalo
software engineeringsoftware securitymachine learning
Tevfik Kosar
Tevfik Kosar
Professor, University at Buffalo (SUNY)
Distributed systemsgreen and sustainable computingAI/ML for systems
B
Bekir Turkkan
IBM Research, Yorktown Heights, NY , USA