🤖 AI Summary
This study systematically evaluates the competitive programming capabilities of ChatGPT-03-mini and DeepSeek-R1 on 29 Codeforces problems spanning easy, medium, and hard difficulty levels. To address the lack of standardized, execution-based evaluation, we propose the first automated benchmarking framework grounded in real test cases, enabling unified quantification across three dimensions: pass rate, memory consumption, and runtime performance. Our analysis reveals a previously undocumented difficulty-sensitivity disparity: ChatGPT achieves a 54.5% pass rate on medium-difficulty problems—significantly outperforming DeepSeek-R1 (18.1%)—while both models perform comparably on easy problems and converge below 8% on hard problems, exposing a shared limitation in complex algorithmic reasoning. This work establishes a reproducible, multi-dimensional evaluation benchmark for assessing LLMs’ programming proficiency, advancing rigorous, execution-aware assessment methodologies in code generation research.
📝 Abstract
The advancement of large language models (LLMs) has created a competitive landscape for AI-assisted programming tools. This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces. Using 29 programming tasks of three levels of easy, medium, and hard difficulty, we assessed the outcome of both models by their accepted solutions, memory efficiency, and runtime performance. Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks, achieving a 54.5% success rate compared to DeepSeek 18.1%. Both models struggled with hard tasks, thus highlighting some ongoing challenges LLMs face in handling highly complex programming problems. These findings highlight key differences in both model capabilities and their computational power, offering valuable insights for developers and researchers working to advance AI-driven programming tools.