🤖 AI Summary
Large language models (LLMs) exhibit “overthinking” in chain-of-thought (CoT) reasoning—repeatedly verifying correct answers due to self-doubt induced by excessive reliance on input prompts and internal uncertainty. This work is the first to formally attribute overthinking to these dual causes and proposes a lightweight, fine-tuning-free prompting paradigm grounded in problem credibility assessment. Our method employs multi-step prompt engineering: (1) problem validity detection to filter ill-posed or ambiguous queries, followed by (2) conditional concise response generation that suppresses redundant verification steps. Evaluated across three mathematical reasoning benchmarks and four missing-premise datasets, our approach consistently reduces answer length and reasoning steps while improving accuracy across four state-of-the-art reasoning LLMs (e.g., Llama-3-70B-Instruct, Qwen2-72B-Instruct). The results demonstrate robust generalization and offer a principled, efficient pathway toward more trustworthy and computationally economical reasoning.
📝 Abstract
Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model's over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.