SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of redundant retrieval results and low signal-to-noise ratios in multi-turn search, which can trap agents in “tunnel vision” and lead to irreversible error accumulation. To mitigate this, the authors propose a robust exploration framework that operates without external verifiers, integrating self-evidence support (SES), dynamic prompt intervention guided by information gain scoring, a diversification branching mechanism, and group-relative policy optimization (GRPO). By dynamically adjusting prompts based on quantified information gain, the framework enhances reasoning diversity and suppresses redundancy. Evaluated on both single-hop and multi-hop question answering benchmarks, the method significantly outperforms existing approaches, achieving higher accuracy with fewer search steps—particularly in complex reasoning tasks.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into"Tunnel Vision,"where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.
Problem

Research questions and friction points this paper is trying to address.

redundancy
signal-to-noise ratio
Tunnel Vision
multi-turn search
error accumulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Evidence Support
Information-Gain Driven Branching
Dynamic Prompting Intervention
Group Relative Policy Optimization
Search-Based Reasoning
🔎 Similar Papers
No similar papers found.
W
Wenlin Zhong
Zhejiang University
J
Jinluan Yang
Zhejiang University
Y
Yiquan Wu
Zhejiang University
Y
Yi Liu
Chongqing Ant Consumer Finance Co., Ltd.
J
Jianhang Yao
Alibaba Group
Kun Kuang
Kun Kuang
Zhejiang University
Causal InferenceData MiningMachine Learning