Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing LLM reasoning benchmarks suffer from rapid obsolescence (broken within months) and susceptibility to prompt-engineering exploits. Method: We propose NPPC—the first perpetual, scalable benchmark grounded in NP-completeness—introducing the “ever-scaling” paradigm. It comprises three integrated modules: npgym (automated generation of 25 NP problem classes), npsolver (online/offline solving and evaluation), and npeval (multi-dimensional, fine-grained analysis). Built on computational complexity theory, its problem-generation framework is provably resistant to algorithmic shortcuts and supports both API-based and local model inference. Contribution/Results: Experiments show mainstream LLMs achieve <10% accuracy, confirming benchmark robustness; DeepSeek-R1 emerges as the strongest current reasoning model; and we report the first observation that higher-tier LLMs exhibit non-monotonic token consumption and “aha moment” latency with increasing problem hardness.

Technology Category

Application Category

📝 Abstract

Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are uncrushable, unhackable, auto-verifiable and general. This paper presents Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark for LLMs. Specifically, the NPPC has three main modules: i) npgym, which provides a unified interface of 25 well-known NP-complete problems and can generate any number of instances with any levels of complexities, ii) npsolver: which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, and iii) npeval: which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over different problems, the number of tokens, the aha moments, the reasoning errors and the solution errors. Extensive experiments over widely-used LLMs demonstrate: i) NPPC can successfully decrease the performances of advanced LLMs' performances to below 10%, demonstrating that NPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the most powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet and o1/o3-mini in most NP-complete problems considered, and iii) the numbers of tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and DeepSeek-R1, are observed first to increase and then decrease when the problem instances become more and more difficult. We believe that NPPC is the first ever-scaling reasoning benchmark, serving as the uncrushable and unhackable testbed for LLMs toward artificial general intelligence (AGI).

Problem

Research questions and friction points this paper is trying to address.

Creating uncrushable, unhackable benchmarks for LLM reasoning evaluation

Addressing rapid obsolescence and vulnerability of current LLM benchmarks

Proposing ever-scaling benchmark with auto-verification for AGI progress

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ever-scaling benchmark with npgym for NP problems

Unified npsolver for online and offline evaluations

Comprehensive npeval for performance analysis tools

🔎 Similar Papers

Evaluating the Performance of Large Language Models via Debates