Score: 0

Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

Published: August 25, 2025 | arXiv ID: 2508.17922v1

By: Bokai Ji , Jie Gu , Xiaokang Ma and more

Potential Business Impact:

Lets robots handle objects based on instructions

Business Areas:

Robotics Hardware, Science and Engineering, Software

Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly.

RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation

Robotics

Helps robots understand how to grab and move things.

16 Nov 2025 1

91%

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

CV and Pattern Recognition

Helps robots understand how to use objects.

13 Nov 2025 1

90%

The Wilhelm Tell Dataset of Affordance Demonstrations

Robotics

Robots learn to do chores by watching videos.

23 Jul 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

16 pages

Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

Lets robots handle objects based on instructions

Technical Abstract

RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

The Wilhelm Tell Dataset of Affordance Demonstrations