🤖 AI Summary
Existing programming contest benchmarks suffer from insufficient problem difficulty, inadequate coverage, and low-quality test cases, leading to severe evaluation bias in assessing large language models (LLMs). Method: We introduce CodeContest, the first high-fidelity benchmark tailored to elite competitions (e.g., IOI, ICPC), comprising 1,200+ manually curated and annotated challenging problems. We further propose a human-in-the-loop test case generation framework that integrates automated construction with expert validation to ensure functional completeness and boundary-case robustness. Contribution/Results: CodeContest substantially widens the performance gap between LLMs and top human competitors—achieving sub-15% average pass@1 accuracy across mainstream models. It systematically exposes fundamental limitations of LLMs in complex algorithmic reasoning, multi-step logical composition, and error recovery. By establishing a more rigorous and trustworthy evaluation standard, CodeContest advances research in code generation and reasoning.
📝 Abstract
Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.