🤖 AI Summary
Existing embodied agents exhibit weak compositional generalization, insufficient robustness under complex disturbances, and pronounced planning-execution decoupling when performing long-horizon, multi-step manipulation tasks. Method: We propose RoboHiMan, a hierarchical evaluation paradigm comprising: (1) HiMan-Bench—the first benchmark explicitly designed for compositional generalization, covering both atomic and composite manipulation tasks; (2) a dual-mode evaluation mechanism—decoupled and coupled—to systematically isolate architectural deficiencies in skill scheduling and disturbance resilience; and (3) a multi-level training dataset and hierarchical reinforcement learning framework enabling joint optimization of high-level semantic planning and low-level policy execution. Results: Experiments show significant performance degradation of state-of-the-art end-to-end and hierarchical models on composite tasks, empirically validating RoboHiMan’s effectiveness and necessity in advancing embodied intelligence toward real-world adaptability.
📝 Abstract
Enabling robots to flexibly schedule and compose learned skills for novel long-horizon manipulation under diverse perturbations remains a core challenge. Early explorations with end-to-end VLA models show limited success, as these models struggle to generalize beyond the training distribution. Hierarchical approaches, where high-level planners generate subgoals for low-level policies, bring certain improvements but still suffer under complex perturbations, revealing limited capability in skill composition. However, existing benchmarks primarily emphasize task completion in long-horizon settings, offering little insight into compositional generalization, robustness, and the interplay between planning and execution. To systematically investigate these gaps, we propose RoboHiMan, a hierarchical evaluation paradigm for compositional generalization in long-horizon manipulation. RoboHiMan introduces HiMan-Bench, a benchmark of atomic and compositional tasks under diverse perturbations, supported by a multi-level training dataset for analyzing progressive data scaling, and proposes three evaluation paradigms (vanilla, decoupled, coupled) that probe the necessity of skill composition and reveal bottlenecks in hierarchical architectures. Experiments highlight clear capability gaps across representative models and architectures, pointing to directions for advancing models better suited to real-world long-horizon manipulation tasks. Videos and open-source code can be found on our project website: https://chenyt31.github.io/robo-himan.github.io/.