🤖 AI Summary
This study systematically evaluates the code generation capabilities of mainstream large language models (LLMs) on LeetCode programming problems, focusing on fundamental limitations in algorithmic reasoning, edge-case handling, and time-complexity optimization.
Method: Leveraging a crawled real-world problem corpus, we invoke APIs of models including GPT-4 and GPT-3.5-turbo, and perform automated execution and testing. Crucially, we introduce the first joint evaluation framework combining pass@k accuracy with empirical runtime measurements.
Contribution/Results: Results reveal that GPT-4 achieves 62.3% pass@1 on medium-difficulty problems but drops sharply to 28.1% on hard ones. Over 40% of failures stem from logical flaws—not syntactic errors—and 73% of generated solutions fail to achieve optimal asymptotic time complexity. These findings expose structural deficiencies in LLMs’ algorithmic reasoning, establishing an empirically grounded evaluation paradigm for programming-assistant tools and informing concrete directions for model improvement.
📝 Abstract
This paper presents a comprehensive performance evaluation of Large Language Models (LLMs) in solving programming challenges from Leetcode, a widely used platform for algorithm practice and technical interviews. We began by crawling the Leetcode website to collect a diverse set of problems encompassing various difficulty levels and topics. Using this dataset, we generated solutions with multiple LLMs, including GPT-4 and GPT-3.5-turbo (ChatGPT-turbo). The generated solutions were systematically evaluated for correctness and efficiency. We employed the pass@k metric to assess the success rates within a given number of attempts and analyzed the runtime performance of the solutions. Our results highlight the strengths and limitations of current LLMs [10] in code generation and problem-solving tasks, providing insights into their potential applications and areas for improvement in automated programming assistance.