How Far Are We from Generating Missing Modalities with Foundation Models?
By: Guanzhou Ke , Bo Wang , Guoqing Chao and more
Potential Business Impact:
Helps computers fill in missing picture or text parts.
Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.
Similar Papers
Filling the Gaps: A Multitask Hybrid Multiscale Generative Framework for Missing Modality in Remote Sensing Semantic Segmentation
CV and Pattern Recognition
Helps computers understand Earth pictures even when data is missing.
FedRecon: Missing Modality Reconstruction in Heterogeneous Distributed Environments
Machine Learning (CS)
Fixes AI learning when data is missing.
Disentangling and Generating Modalities for Recommendation in Missing Modality Scenarios
Information Retrieval
Recommends better even with missing info.