X-Fusion: Introducing New Modality to Frozen Large Language Models
By: Sicheng Mo , Thao Nguyen , Xun Huang and more
Potential Business Impact:
Lets computers understand and create pictures from words.
We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.
Similar Papers
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
CV and Pattern Recognition
Lets computers understand pictures and words together.
OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion
Computation and Language
Translates speech and images to text faster.
Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model
CV and Pattern Recognition
Helps AI see tiny details better.