Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models
By: Rui Hu , Delai Qiu , Shuyu Wei and more
Potential Business Impact:
Teaches computers to understand sounds with pictures.
Omnimodal Large Language Models (OLLMs) have shown significant progress in integrating vision and text, but still struggle with integrating vision and audio, often exhibiting suboptimal performance when processing audio queries compared to text queries. This disparity is primarily due to insufficient alignment between vision and audio modalities during training, leading to inadequate attention to visual information when using audio queries. To mitigate this issue, we propose a Self-Knowledge Distillation (Self-KD) training method where the vision-text component of the OLLM serves as the teacher and the vision-audio component as the student. This enables the model to process audio in a manner analogous to its text processing. Our experimental results demonstrate that Self-KD is an effective method for enhancing the vision-audio capabilities of OLLMs by learning from the vision-text components, which subsequently improves the interaction between audio and images and results in improved performance on multimodal tasks.
Similar Papers
Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
Sound
Teaches AI to hear and see better.
Aligned Better, Listen Better for Audio-Visual Large Language Models
CV and Pattern Recognition
Helps computers understand videos by listening.
EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens
CV and Pattern Recognition
Makes AI understand pictures better without using more power.