🤖 AI Summary
Current interactive large language model agents often rely on goal-conditioned step-by-step planning, resulting in passive environmental perception and inefficient trial-and-error behavior. This work proposes the Map-then-Act (MAP) paradigm, which introduces cognitive mapping theory into interactive agents for the first time. MAP enables proactive reasoning through a three-stage process: global exploration, task-specific map construction, and knowledge-augmented execution—effectively decoupling environment understanding from immediate action. We develop a modular MAP framework and release the MAP-2K dataset to empirically demonstrate that comprehensive environmental understanding fundamentally outperforms behavioral imitation. Evaluated on benchmarks such as ARC-AGI-3, MAP substantially improves performance: state-of-the-art models surpass near-zero baselines in 22 out of 25 environments, and training on MAP-2K yields better results than expert trajectory imitation.
📝 Abstract
Current interactive LLM agents rely on goal-conditioned stepwise planning, where environmental understanding is acquired reactively during execution rather than established beforehand. This temporal inversion leads to Delayed Environmental Perception: agents must infer environmental constraints through trial-and-error, resulting in an Epistemic Bottleneck that traps them in inefficient failure cycles. Inspired by human affordance perception and cognitive map theory, we propose the Map-then-Act Paradigm (MAP), a plug-and-play framework that shifts environment understanding before execution. MAP consists of three stages: (1) Global Exploration, acquiring environment-general priors; (2) Task-Specific Mapping, constructing a structured cognitive map; and (3) Knowledge-Augmented Execution, solving tasks grounded on the map. Experiments show consistent gains across benchmarks and LLMs. On ARC-AGI-3, MAP enables frontier models to surpass near-zero baseline performance in 22 of 25 game environments. We further introduce MAP-2K, a dataset of map-then-act trajectories, and show that training on it outperforms expert execution traces, suggesting that understanding environments is more fundamental than imitation.