ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from a “tunnel vision” bottleneck during test-time compute scaling, where serial inference limits marginal performance gains despite increased computation. Method: We propose Native Parallel Thinking (NPT), a novel paradigm that trains models end-to-end to generate multiple independent and diverse reasoning paths in parallel, then fuses their outputs via a learnable aggregation mechanism—shifting compute scaling from serial depth to parallel width. Contribution/Results: NPT enables truly native parallel reasoning at inference time for the first time, substantially improving knowledge utilization efficiency and reasoning robustness. On multiple challenging reasoning benchmarks, NPT boosts accuracy by 12.3% (1.5B model) and 7.5% (7B model), with only a 7.1% latency overhead; notably, the smaller model surpasses larger baseline models in performance.

Technology Category

Application Category

📝 Abstract
Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant bottleneck as computation increases, where further computation offers only marginal performance gains. We argue this ceiling is not an inherent limit of the model's capability but a flaw in the scaling strategy itself, a phenomenon we term "Tunnel Vision", where a model's imperfect initial steps lock it into a suboptimal reasoning path. To overcome this, we introduce a new scaling paradigm: native thought parallelism. We present ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. By exploring different lines of thoughts simultaneously, ParaThinker effectively sidesteps the Tunnel Vision issue and unlocks the model's latent reasoning potential. Our approach demonstrates that scaling compute in parallel (width) is a more effective and efficient way to superior reasoning than simply scaling sequentially (depth). On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), while adding only negligible latency overhead (7.1%). This enables smaller models to surpass much larger counterparts and establishes parallel thinking as a critical, efficient dimension for scaling future LLMs.
Problem

Research questions and friction points this paper is trying to address.

Overcoming Tunnel Vision in sequential LLM reasoning paths
Scaling test-time compute via parallel thought generation
Improving reasoning accuracy with minimal latency overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multiple diverse reasoning paths in parallel
Synthesizes parallel thoughts into superior final answer
Scales compute widthwise to overcome sequential bottlenecks
🔎 Similar Papers
No similar papers found.
H
Hao Wen
Institute for AI Industry Research (AIR), Tsinghua University
Y
Yifan Su
Institute for AI Industry Research (AIR), Tsinghua University
F
Feifei Zhang
Institute for AI Industry Research (AIR), Tsinghua University
Yunxin Liu
Yunxin Liu
IEEE Fellow, Guoqiang Professor, Institute for AI Industry Research (AIR), Tsinghua University
Mobile ComputingEdge ComputingAIoTSystemNetworking
Yunhao Liu
Yunhao Liu
ACM Fellow, IEEE Fellow, CCF Fellow, Tsinghua University
Wireless Sensor Networks/RFIDCyber Physical Systems and IoTPrivacy and SecurityCloud Computing
Y
Ya-Qin Zhang
Institute for AI Industry Research (AIR), Tsinghua University
Yuanchun Li
Yuanchun Li
Institute for AI Industry Research (AIR), Tsinghua University
mobile computingartificial intelligence