Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
By: Haoyu Zhang , Qiaohui Chu , Meng Liu and more
Potential Business Impact:
Helps robots understand what you see.
AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. Current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D. Our approach features a progressive training pipeline with three stages: Teacher Self-Preparation, Teacher-Student Guidance, and Student Self-Practice. Additionally, we propose an instruction-tuning data EgoIT from multiple sources to strengthen the model's instruction-following capabilities, along with the EgoBench benchmark comprising eight different tasks for thorough evaluation. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.
Similar Papers
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
CV and Pattern Recognition
Helps computers understand videos from a person's eyes.
EgoX: Egocentric Video Generation from a Single Exocentric Video
CV and Pattern Recognition
Turns normal videos into your own first-person view.
EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
CV and Pattern Recognition
Makes videos show what your hands are doing.