Lost at the Beginning of Reasoning

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work identifies a “first-step bias amplification” phenomenon in large language models (LLMs): minor errors in the initial reasoning step of chain-of-thought (CoT) significantly degrade subsequent reasoning quality, and current models exhibit limited self-correction capability. To address this, we first systematically quantify the effect by constructing a novel benchmark with controllable first-step defects. We then propose a reward-model-guided efficient sampling strategy to precisely select high-quality initial steps. Finally, we conduct an in-depth analysis of the reasoning robustness mechanisms in DeepSeek-R1 and Qwen3. Experiments demonstrate that our approach reduces inference cost by up to 70% while preserving reasoning accuracy, offering an interpretable and reproducible optimization paradigm for enhancing early-stage reasoning stability in LLMs.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on overthinking suggest that such models often engage in unnecessarily redundant reasoning. In this work, we empirically show that the first reasoning step exerts a disproportionately large influence on the final prediction - errors introduced at this stage can substantially degrade subsequent reasoning quality. This phenomenon is consistently observed across two state-of-the-art open-source reasoning model families: DeepSeek-R1 and Qwen3. To address this, we propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps while discarding suboptimal ones, achieving up to a 70% reduction in inference cost without sacrificing accuracy. Finally, we introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities, offering a foundation for future research on robust reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

LLMs' self-correction in long chain-of-thought reasoning is underexplored

First reasoning step errors disproportionately degrade final prediction quality

Overthinking causes redundant reasoning in state-of-the-art models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages reward model for quality reasoning steps

Reduces inference cost by 70% without accuracy loss

Introduces benchmark for flawed reasoning step evaluation

🔎 Similar Papers

No similar papers found.