Score: 2

How Far Are We from Generating Missing Modalities with Foundation Models?

Published: June 4, 2025 | arXiv ID: 2506.03530v2

By: Guanzhou Ke , Bo Wang , Guoqing Chao and more

Potential Business Impact:

Helps computers fill in missing picture or text parts.

Business Areas:
Autonomous Vehicles Transportation

Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.

Country of Origin
πŸ‡ΈπŸ‡¬ πŸ‡¨πŸ‡³ Singapore, China

Repos / Data Links

Page Count
12 pages

Category
Computer Science:
Multimedia