🤖 AI Summary
Existing object manipulation research overlooks the critical issue that functional affordances dynamically vary with task instructions. Method: We propose an instruction-guided manipulability prediction paradigm, defining instruction-dependent, fine-grained manipulability—i.e., task-specific interaction regions and directions—and introduce the first first-person-view dataset comprising 15,000 object–instruction–manipulability triplets. Our approach employs an iterative “search–adversarial–verification” reasoning pipeline that integrates large multimodal models’ vision-language understanding with self-verification capabilities to generate semantically consistent manipulability maps. Contribution/Results: On our new benchmark, our method significantly outperforms prior approaches, substantially improving both natural language instruction comprehension accuracy and spatial localization robustness across diverse commands. This work establishes an interpretable and generalizable foundation for manipulation-aware perception in embodied intelligence.
📝 Abstract
Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly.