Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing object manipulation research overlooks the critical issue that functional affordances dynamically vary with task instructions. Method: We propose an instruction-guided manipulability prediction paradigm, defining instruction-dependent, fine-grained manipulability—i.e., task-specific interaction regions and directions—and introduce the first first-person-view dataset comprising 15,000 object–instruction–manipulability triplets. Our approach employs an iterative “search–adversarial–verification” reasoning pipeline that integrates large multimodal models’ vision-language understanding with self-verification capabilities to generate semantically consistent manipulability maps. Contribution/Results: On our new benchmark, our method significantly outperforms prior approaches, substantially improving both natural language instruction comprehension accuracy and spatial localization robustness across diverse commands. This work establishes an interpretable and generalizable foundation for manipulation-aware perception in embodied intelligence.

Technology Category

Application Category

📝 Abstract
Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly.
Problem

Research questions and friction points this paper is trying to address.

Predicting task-dependent affordances for object manipulation
Enabling large multimodal models as instruction-oriented affordance predictors
Addressing egocentric viewpoint challenges in robotic manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-dependent affordance prediction via LMMs
Egocentric dataset with instruction-affordance triplets
Iterative self-verification reasoning pipeline
🔎 Similar Papers
No similar papers found.
B
Bokai Ji
Xidian University
J
Jie Gu
Rightly Robotics
X
Xiaokang Ma
Rightly Robotics
C
Chu Tang
Rightly Robotics
Jingmin Chen
Jingmin Chen
Alibaba Group
Choice BehaviorMachine LearningDeep LearningTransportation
G
Guangxia Li
Xidian University