Score: 0

Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

Published: August 25, 2025 | arXiv ID: 2508.17922v1

By: Bokai Ji , Jie Gu , Xiaokang Ma and more

Potential Business Impact:

Lets robots handle objects based on instructions

Business Areas:
Robotics Hardware, Science and Engineering, Software

Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly.

Country of Origin
🇨🇳 China

Page Count
16 pages

Category
Computer Science:
Robotics