A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the competitive programming capabilities of ChatGPT-03-mini and DeepSeek-R1 on 29 Codeforces problems spanning easy, medium, and hard difficulty levels. To address the lack of standardized, execution-based evaluation, we propose the first automated benchmarking framework grounded in real test cases, enabling unified quantification across three dimensions: pass rate, memory consumption, and runtime performance. Our analysis reveals a previously undocumented difficulty-sensitivity disparity: ChatGPT achieves a 54.5% pass rate on medium-difficulty problems—significantly outperforming DeepSeek-R1 (18.1%)—while both models perform comparably on easy problems and converge below 8% on hard problems, exposing a shared limitation in complex algorithmic reasoning. This work establishes a reproducible, multi-dimensional evaluation benchmark for assessing LLMs’ programming proficiency, advancing rigorous, execution-aware assessment methodologies in code generation research.

Technology Category

Application Category

📝 Abstract
The advancement of large language models (LLMs) has created a competitive landscape for AI-assisted programming tools. This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces. Using 29 programming tasks of three levels of easy, medium, and hard difficulty, we assessed the outcome of both models by their accepted solutions, memory efficiency, and runtime performance. Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks, achieving a 54.5% success rate compared to DeepSeek 18.1%. Both models struggled with hard tasks, thus highlighting some ongoing challenges LLMs face in handling highly complex programming problems. These findings highlight key differences in both model capabilities and their computational power, offering valuable insights for developers and researchers working to advance AI-driven programming tools.
Problem

Research questions and friction points this paper is trying to address.

Evaluates ChatGPT and DeepSeek on programming task performance.
Compares success rates, memory efficiency, and runtime performance.
Identifies challenges in solving highly complex programming tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates ChatGPT and DeepSeek on programming tasks
Assesses models by success rate and efficiency
Highlights challenges in complex problem-solving
🔎 Similar Papers
No similar papers found.
Ronas Shakya
Ronas Shakya
Univeristy of Bergen
Farhad Vadiee
Farhad Vadiee
University of Bergen
Theoretical Computer Science
M
Mohammad Khalil
Centre for the Science of Learning & Technology (SLATE), Univeristy of Bergen, Bergen, Norway