MultiModal Action Conditioned Video Generation
By: Yichen Li, Antonio Torralba
Potential Business Impact:
Robots learn to do delicate tasks with touch.
Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.
Similar Papers
Modality-Augmented Fine-Tuning of Foundation Robot Policies for Cross-Embodiment Manipulation on GR1 and G1
Robotics
Robots learn to do new tasks better.
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
CV and Pattern Recognition
Teaches computers to see every body part move.
Video Generation Models in Robotics - Applications, Research Challenges, Future Directions
Systems and Control
Makes robots learn and act by watching videos.