Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model
By: Bokai Ji , Jie Gu , Xiaokang Ma and more
Potential Business Impact:
Lets robots handle objects based on instructions
Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly.
Similar Papers
RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation
Robotics
Helps robots understand how to grab and move things.
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models
CV and Pattern Recognition
Helps robots understand how to use objects.
The Wilhelm Tell Dataset of Affordance Demonstrations
Robotics
Robots learn to do chores by watching videos.