🤖 AI Summary
Existing approaches struggle to accurately localize multifunctional regions in cluttered real-world scenes according to task instructions and lack benchmarks supporting complex mappings such as one-to-many task-to-region correspondences. This work proposes a task-conditioned, scene-level functional affordance localization framework and introduces the first real-scene affordance benchmark dataset encompassing both single- and multi-region instruction mappings. To enable efficient and high-quality annotation, we design the A2A-AffordGen pipeline, which integrates large language model filtering, interactive segmentation, mask refinement, and human verification. Models trained on this dataset significantly outperform baseline methods—including general-purpose segmentation models, vision-language models, and affordance distillation approaches—in both task-level localization accuracy and spatial priors for downstream manipulation tasks.
📝 Abstract
Task-conditioned manipulation requires grounding instructions to task-relevant functional parts rather than object categories. This setting is scene-dependent and often one-to-many in cluttered scenes: the same object may afford different interactions across tasks, while a single task may correspond to either one functional region or multiple valid functional regions, depending on the scene layout. Existing affordance datasets and benchmarks remain misaligned with this setting, as they typically focus on grasping or object-level affordances, rely on synthetic scenes, or assume a single instruction-region correspondence. We present Affordance2Action (A2A), a benchmark-centered learning framework for scene-level, task-conditioned part affordance grounding. At its core is A2A-Bench, a manipulation-oriented benchmark that covers both single-region and multi-region instruction correspondences in everyday scenes, with the latter highlighting the ambiguity and diversity of affordance grounding in realistic multi-object environments. To construct it at scale, we build A2A-AffordGen, an agent-assisted annotation pipeline that combines language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification. A2A-Bench's supervision further supports diverse downstream applications, with real-time affordance grounding and affordance-conditioned manipulation policies as two representative examples. Experiments show that A2A exposes substantial gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines, while improving task-level localization and providing useful spatial priors for downstream manipulation. All datasets and code will be publicly released to promote open research.