🤖 AI Summary
This work addresses the phenomenon of “harmful overthinking” in large reasoning models, where continued reasoning after generating a correct answer can degrade performance—a behavior insufficiently examined in prior research. The authors propose a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, which explicitly distinguishes benign redundancy from harmful overthinking. Validated across both multimodal and language-only reasoning benchmarks, the protocol demonstrates that halting inference at the first correct prefix boosts accuracy by up to 21% on multiple benchmarks. Notably, conventional efficiency-oriented strategies that reduce redundant computation fail to mitigate harmful overthinking. These findings reveal that model performance is constrained not only by reasoning capability but also critically by the ability to stop reasoning at the appropriate time.
📝 Abstract
Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: "Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?" To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at https://simonecaldarella.github.io/thinking-past-the-answer.