COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

In long-horizon tasks, LLM-based agents suffer from degraded coherence and accuracy due to error accumulation, hallucination, and context overload—stemming primarily from inadequate dynamic context management and insufficient coordination across multi-step reasoning. To address this, we propose a hierarchical three-module collaborative architecture: a primary agent for tactical execution, a meta-thinker for strategic oversight and reflective intervention, and a context manager that maintains high-information-density state via dynamic summarization and lightweight scheduling. This design enables test-time scaling and facilitates efficient post-training optimization for smaller models. Evaluated on GAIA, BrowseComp, and Humanity’s Last Exam, our approach achieves up to a 20% absolute accuracy gain, matches the performance of DeepResearch, and significantly improves both reasoning efficiency and long-range coherence.

Technology Category

Application Category

📝 Abstract

Long-horizon tasks that require sustained reasoning and multiple tool interactions remain challenging for LLM agents: small errors compound across steps, and even state-of-the-art models often hallucinate or lose coherence. We identify context management as the central bottleneck -- extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta-Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. Across three challenging benchmarks -- GAIA, BrowseComp, and Humanity's Last Exam -- COMPASS improves accuracy by up to 20% relative to both single- and multi-agent baselines. We further introduce a test-time scaling extension that elevates performance to match established DeepResearch agents, and a post-training pipeline that delegates context management to smaller models for enhanced efficiency.

Problem

Research questions and friction points this paper is trying to address.

Addressing context management in long-horizon reasoning tasks

Reducing error accumulation and hallucinations in LLM agents

Enhancing coherence through specialized hierarchical agent components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework with three specialized components

Meta-Thinker monitors progress and issues interventions

Context Manager maintains concise relevant progress briefs

🔎 Similar Papers

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments