🤖 AI Summary
Existing benchmarks for search agents are constrained by their reliance on manually curated tasks, limiting the scale and structural complexity of the search space. This work proposes the first automated pipeline leveraging a large-scale knowledge graph encompassing 7 million Wikipedia entities to generate a new benchmark comprising 544 human-verified, long-horizon search questions spanning 11 domains. These questions feature high complexity, expansive search spaces, and unique answers. The benchmark substantially raises the bar for evaluation: even the strongest model achieves only 34.74% accuracy—far below the over-90% performance observed on prior benchmarks—and existing context management strategies yield limited gains (at most +6.8%), underscoring the challenge this benchmark poses for complex reasoning and effective context handling.
📝 Abstract
Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.