Affordance2Action: Task-Conditioned Scene-level Affordance Grounding for Real-Time Manipulation

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing approaches struggle to accurately localize multifunctional regions in cluttered real-world scenes according to task instructions and lack benchmarks supporting complex mappings such as one-to-many task-to-region correspondences. This work proposes a task-conditioned, scene-level functional affordance localization framework and introduces the first real-scene affordance benchmark dataset encompassing both single- and multi-region instruction mappings. To enable efficient and high-quality annotation, we design the A2A-AffordGen pipeline, which integrates large language model filtering, interactive segmentation, mask refinement, and human verification. Models trained on this dataset significantly outperform baseline methods—including general-purpose segmentation models, vision-language models, and affordance distillation approaches—in both task-level localization accuracy and spatial priors for downstream manipulation tasks.

📝 Abstract

Task-conditioned manipulation requires grounding instructions to task-relevant functional parts rather than object categories. This setting is scene-dependent and often one-to-many in cluttered scenes: the same object may afford different interactions across tasks, while a single task may correspond to either one functional region or multiple valid functional regions, depending on the scene layout. Existing affordance datasets and benchmarks remain misaligned with this setting, as they typically focus on grasping or object-level affordances, rely on synthetic scenes, or assume a single instruction-region correspondence. We present Affordance2Action (A2A), a benchmark-centered learning framework for scene-level, task-conditioned part affordance grounding. At its core is A2A-Bench, a manipulation-oriented benchmark that covers both single-region and multi-region instruction correspondences in everyday scenes, with the latter highlighting the ambiguity and diversity of affordance grounding in realistic multi-object environments. To construct it at scale, we build A2A-AffordGen, an agent-assisted annotation pipeline that combines language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification. A2A-Bench's supervision further supports diverse downstream applications, with real-time affordance grounding and affordance-conditioned manipulation policies as two representative examples. Experiments show that A2A exposes substantial gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines, while improving task-level localization and providing useful spatial priors for downstream manipulation. All datasets and code will be publicly released to promote open research.

Problem

Research questions and friction points this paper is trying to address.

affordance grounding

task-conditioned manipulation

scene-level affordance

functional regions

instruction-region correspondence

Innovation

Methods, ideas, or system contributions that make the work stand out.

task-conditioned affordance

scene-level grounding

multi-region correspondence