ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically evaluates the rigorous proof reasoning and explicit construction capabilities of large language models on Olympiad-level combinatorics problems. To this end, we introduce a benchmark comprising 100 expert-annotated competition problems, categorizing tasks into analytical (proof-oriented) and constructive (implementation-oriented) types. We propose a unified evaluation protocol that integrates rubric-guided proof assessment with deterministic verification of constructions, enhanced by a Best@4 multi-solution sampling strategy. Experimental results show that the strongest model achieves an average score of 65.4% overall (75.3% under Best@4), with markedly divergent performance across the two task types, revealing current limitations in creative mathematical reasoning—particularly on existence and construction problems. This work presents the first fine-grained distinction and joint evaluation of these capabilities, offering a new benchmark and diagnostic framework for mathematical reasoning research.

📝 Abstract

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

Problem

Research questions and friction points this paper is trying to address.

combinatorics

mathematical reasoning

proof reasoning

constructive realization

Olympiad-level problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

combinatorics benchmark

proof reasoning

constructive realization