🤖 AI Summary
To address insufficient cross-environment, cross-task, and cross-instruction-form generalization in open-world mobile manipulation, this paper proposes a Structured Affordance Grounding framework. It unifies multimodal instructions—including language, point-and-click inputs, and demonstrations—into 3D affordance heatmaps, thereby decoupling high-level semantic intent from low-level visual-motor control. The framework integrates multimodal foundation models, 3D vision–affordance alignment, conditional policy learning, and whole-body motion control. It achieves, for the first time, zero-shot cross-scene execution and few-shot adaptation, overcoming generalization bottlenecks inherent in both end-to-end and conventional modular approaches. Evaluated on 11 real-world tasks, it significantly outperforms state-of-the-art baselines, demonstrating superior generalization capability and robustness.
📝 Abstract
We present SAGA, a versatile and adaptive framework for visuomotor control that can generalize across various environments, task objectives, and user specifications. To efficiently learn such capability, our key idea is to disentangle high-level semantic intent from low-level visuomotor control by explicitly grounding task objectives in the observed environment. Using an affordance-based task representation, we express diverse and complex behaviors in a unified, structured form. By leveraging multimodal foundation models, SAGA grounds the proposed task representation to the robot's visual observation as 3D affordance heatmaps, highlighting task-relevant entities while abstracting away spurious appearance variations that would hinder generalization. These grounded affordances enable us to effectively train a conditional policy on multi-task demonstration data for whole-body control. In a unified framework, SAGA can solve tasks specified in different forms, including language instructions, selected points, and example demonstrations, enabling both zero-shot execution and few-shot adaptation. We instantiate SAGA on a quadrupedal manipulator and conduct extensive experiments across eleven real-world tasks. SAGA consistently outperforms end-to-end and modular baselines by substantial margins. Together, these results demonstrate that structured affordance grounding offers a scalable and effective pathway toward generalist mobile manipulation.