Score: 0

Continual Cross-Modal Generalization

Published: April 1, 2025 | arXiv ID: 2504.00561v1

By: Yan Xia , Hai Huang , Minghui Fang and more

Potential Business Impact:

Lets computers learn from pictures, sound, and words together.

Business Areas:
Motion Capture Media and Entertainment, Video

Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.

Country of Origin
🇨🇳 China

Page Count
15 pages

Category
Computer Science:
CV and Pattern Recognition