🤖 AI Summary
This work addresses the inefficiency of large language models in multi-hop question answering, where fixed chain-of-thought reasoning often leads to “overthinking” simple sub-questions and unnecessary computational overhead. The authors propose the first fully reinforcement learning–based adaptive interleaved reasoning framework, which dynamically decides at each step whether to perform explicit reasoning, thereby overcoming the limitations of globally uniform strategies. Notably, the method requires no supervised fine-tuning for cold-start deployment and incorporates a quality-gated efficiency reward within a graph-structured reasoning environment (Graph-R1) to enable fine-grained control over reasoning budgets. Experiments on HotpotQA demonstrate a 90.35% reduction in average reasoning tokens per question (69.71% overall) while maintaining answer accuracy comparable to or better than baseline methods, revealing that overthinking primarily stems from the initial planning phase.
📝 Abstract
Large Language Models (LLMs) have achieved remarkable performance in complex reasoning tasks through Chain-of-Thought (CoT) prompting. However, this approach often leads to ``over-thinking,'' where models generate unnecessarily long reasoning traces for simple queries and incur avoidable inference cost. While recent work has explored adaptive reasoning, existing methods typically make a single query-level decision about whether to reason. This overlooks the dynamic nature of multi-step tasks, where the need for explicit reasoning varies across intermediate stages. To address this limitation, we introduce AdaptR1, a Reinforcement Learning (RL) based framework for adaptive interleaved thinking in multi-hop Question Answering (QA). Unlike previous approaches that require Supervised Fine-Tuning (SFT) for cold-start initialization, AdaptR1 uses a fully RL-based strategy with a quality-gated efficiency reward to dynamically allocate reasoning budgets at each step. Under the Graph-R1 setting, AdaptR1 reduces average think tokens by 69.71\%, with a 90.35\% reduction on HotpotQA, while maintaining performance comparable to or better than standard baselines. Furthermore, our analysis reveals that overthinking in multi-hop reasoning is not uniformly distributed but occurs predominantly during the initial planning stages, highlighting the effectiveness of step-wise adaptive budget allocation.