Evaluating the Energy-Efficiency of the Code Generated by LLMs

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Energy efficiency of code generated by large language models (LLMs) remains unexamined despite growing deployment in software development. Method: We introduce the first reproducible, hardware-level energy efficiency benchmark for LLM-generated code, evaluating 20 mainstream models on 878 LeetCode problems under standardized execution environments with fine-grained power monitoring. Contribution/Results: Our analysis reveals substantial energy inefficiency: Grok-2’s code consumes up to 450× more energy than human-optimal solutions on dynamic programming tasks; human reference implementations are 17–21% more energy-efficient than DeepSeek-v3 and GPT-4o, and over 100% more efficient than Grok-2 and Gemini-1.5-Pro. We identify coupled effects of model architecture and algorithmic problem class on energy consumption and establish consistent cross-model energy-efficiency rankings. This work provides foundational empirical evidence and a rigorous methodology for green AI development and evaluation.

Technology Category

Application Category

📝 Abstract

As the quality of code generated by Large Language Models (LLMs) improves, their adoption in the software industry for automated code generation continues to grow. Researchers primarily focus on enhancing the functional correctness of the generated code while commonly overlooking its energy efficiency and environmental impact. This paper investigates the energy efficiency of the code generated by 20 popular LLMs for 878 programming problems of varying difficulty levels and diverse algorithmic categories selected from the LeetCode platform by comparing them against canonical human-written solutions. Although LLMs can produce functionally correct results in most cases, our findings show that the performance and energy efficiency of LLM-produced solutions are often far below those of human-written solutions. Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code, whereas Grok-2 and Gemini-1.5-Pro are among the least energy-efficient models. On average, human-generated canonical solutions are approximately 1.17 times more energy efficient than DeepSeek-v3, 1.21 times more energy efficient than GPT-4o, and over 2 times more energy efficient than Grok-2 and Gemini-1.5-Pro. For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating energy efficiency of LLM-generated code

Comparing LLM code to human solutions

Identifying performance gaps in energy consumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates energy efficiency of LLM-generated code

Compares 20 LLMs against human-written solutions

Identifies DeepSeek-v3 and GPT-4o as most efficient

🔎 Similar Papers

EffiBench: Benchmarking the Efficiency of Automatically Generated Code