SAGA: Open-World Mobile Manipulation via Structured Affordance Grounding

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient cross-environment, cross-task, and cross-instruction-form generalization in open-world mobile manipulation, this paper proposes a Structured Affordance Grounding framework. It unifies multimodal instructions—including language, point-and-click inputs, and demonstrations—into 3D affordance heatmaps, thereby decoupling high-level semantic intent from low-level visual-motor control. The framework integrates multimodal foundation models, 3D vision–affordance alignment, conditional policy learning, and whole-body motion control. It achieves, for the first time, zero-shot cross-scene execution and few-shot adaptation, overcoming generalization bottlenecks inherent in both end-to-end and conventional modular approaches. Evaluated on 11 real-world tasks, it significantly outperforms state-of-the-art baselines, demonstrating superior generalization capability and robustness.

Technology Category

Application Category

📝 Abstract
We present SAGA, a versatile and adaptive framework for visuomotor control that can generalize across various environments, task objectives, and user specifications. To efficiently learn such capability, our key idea is to disentangle high-level semantic intent from low-level visuomotor control by explicitly grounding task objectives in the observed environment. Using an affordance-based task representation, we express diverse and complex behaviors in a unified, structured form. By leveraging multimodal foundation models, SAGA grounds the proposed task representation to the robot's visual observation as 3D affordance heatmaps, highlighting task-relevant entities while abstracting away spurious appearance variations that would hinder generalization. These grounded affordances enable us to effectively train a conditional policy on multi-task demonstration data for whole-body control. In a unified framework, SAGA can solve tasks specified in different forms, including language instructions, selected points, and example demonstrations, enabling both zero-shot execution and few-shot adaptation. We instantiate SAGA on a quadrupedal manipulator and conduct extensive experiments across eleven real-world tasks. SAGA consistently outperforms end-to-end and modular baselines by substantial margins. Together, these results demonstrate that structured affordance grounding offers a scalable and effective pathway toward generalist mobile manipulation.
Problem

Research questions and friction points this paper is trying to address.

Generalizes visuomotor control across environments and tasks
Grounds semantic intent to 3D affordances for robust generalization
Enables zero-shot and few-shot adaptation via multimodal task specification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured affordance grounding for visuomotor control
Multimodal models generate 3D affordance heatmaps from observations
Unified framework handles language, points, demonstrations for adaptation
🔎 Similar Papers
No similar papers found.