RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) exhibit limited capability in functional reasoning—such as grasp point identification, placement region localization, and traversable space detection—due to the absence of fine-grained object and spatial affordance annotations. To address this, we introduce RoboAfford++, the first million-scale, robot-centric vision-language question-answering dataset explicitly designed for manipulation and navigation tasks. It systematically covers three functional semantic categories: object attributes, manipulable regions, and traversable spaces. We further propose RoboAfford-Eval, a dedicated benchmark for evaluating functional reasoning. Data are synthetically generated using generative AI, and VLMs are fine-tuned via multi-task question answering. Experiments demonstrate that state-of-the-art VLMs suffer from significant bottlenecks on functional reasoning tasks; however, fine-tuning on RoboAfford++ yields an average performance gain of 23.6% across all three tasks, empirically validating the critical role of fine-grained affordance data in enhancing VLMs’ functional understanding.

Technology Category

Application Category

📝 Abstract
Robotic manipulation and navigation are fundamental capabilities of embodied intelligence, enabling effective robot interactions with the physical world. Achieving these capabilities requires a cohesive understanding of the environment, including object recognition to localize target objects, object affordances to identify potential interaction areas and spatial affordances to discern optimal areas for both object placement and robot movement. While Vision-Language Models (VLMs) excel at high-level task planning and scene understanding, they often struggle to infer actionable positions for physical interaction, such as functional grasping points and permissible placement regions. This limitation stems from the lack of fine-grained annotations for object and spatial affordances in their training datasets. To tackle this challenge, we introduce RoboAfford++, a generative AI-enhanced dataset for multimodal affordance learning for both robotic manipulation and navigation. Our dataset comprises 869,987 images paired with 2.0 million question answering (QA) annotations, covering three critical tasks: object affordance recognition to identify target objects based on attributes and spatial relationships, object affordance prediction to pinpoint functional parts for manipulation, and spatial affordance localization to identify free space for object placement and robot navigation. Complementing this dataset, we propose RoboAfford-Eval, a comprehensive benchmark for assessing affordance-aware prediction in real-world scenarios, featuring 338 meticulously annotated samples across the same three tasks. Extensive experimental results reveal the deficiencies of existing VLMs in affordance learning, while fine-tuning on the RoboAfford++ dataset significantly enhances their ability to reason about object and spatial affordances, validating the dataset's effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Addresses robotic manipulation and navigation limitations in Vision-Language Models
Overcomes lack of fine-grained affordance annotations in training datasets
Enables precise localization of functional interaction points and free spaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative AI-enhanced dataset for multimodal affordance learning
Comprehensive benchmark for assessing affordance-aware prediction
Fine-tuning enhances reasoning about object and spatial affordances
🔎 Similar Papers
No similar papers found.