Continual Cross-Modal Generalization
By: Yan Xia , Hai Huang , Minghui Fang and more
Potential Business Impact:
Lets computers learn from pictures, sound, and words together.
Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.
Similar Papers
Continual Learning for Multiple Modalities
CV and Pattern Recognition
Teaches computers to learn new things without forgetting.
Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models
Machine Learning (CS)
Teaches AI to learn new things without forgetting old ones.
Taming Modality Entanglement in Continual Audio-Visual Segmentation
Multimedia
Helps computers learn new sounds and sights.