π€ AI Summary
This study addresses the inefficiency in long-horizon search agents caused by accumulating excessive retrieved content across multiple tool-use rounds. The authors systematically investigate the impact of masking outdated observations on search performance and uncover, for the first time, an asymmetric inverted U-shaped relationship: performance gains peak when medium-scale language models (4Bβ284B parameters) are paired with high-recall retrievers, while stronger models suffer performance degradation due to the loss of critical evidence. Through comprehensive offline and online benchmarks, trajectory analysis, and attention mechanism modeling, the work demonstrates that context management should be treated as a dynamic intervention strategy, contingent upon the interplay between model capability and retriever effectiveness. The authors release an open-source experimental framework and associated trajectory data.
π Abstract
Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps and why. We study observation masking through a systematic sweep over various agent backbones (4B to 284B parameters) and three retrievers on offline and live-web agentic search benchmarks. We find that the accuracy gain from masking follows an asymmetric inverted-U shape when plotted against the model's accuracy without context management: a plateau under weak retrievers, a peak when a strong retriever meets a mid-capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model's implicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements a token-for-turn trade-off: it removes observations the model has largely stopped attending to and pages the agent rarely re-opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframe context management as a regime-dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (https://github.com/i-DeepSearch/observation-masking) to support future research.