π€ AI Summary
This work addresses the challenge that large language models face in long-horizon, multi-turn agent tasks, where continuously growing context leads to global state tracking difficulties and long-context interference, impairing reasoning and decision-making. The authors propose an end-to-end approach that requires neither expert trajectories nor auxiliary models. Their method employs hierarchical planning to decompose tasks into explicit subgoals and integrates an information folding mechanism to compress historical information from completed subgoals, thereby mitigating interference. Additionally, they introduce hierarchical reflection and a subgoal-oriented process reward scheme to stabilize subgoal generation, transition, and execution. Evaluated on three public agent benchmarks, the approach significantly improves performance on long-horizon tasks.
π Abstract
While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to alleviate long-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long-term dependency. However, these methods still do not directly address long-context interference, in which continuously growing histories weaken the agent's ability to track the global task state and impair subsequent reasoning and decision-making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. Furthermore, to stabilize subgoal-based planning and execution, HIPIF combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.