End-to-End Multi-Modal Diffusion Mamba
By: Chunhao Lu , Qiang Lu , Meichen Dong and more
Potential Business Impact:
Makes computers understand pictures and words together better.
Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.
Similar Papers
TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
CV and Pattern Recognition
Teaches new computer brains using old ones.
SynergyNet: Fusing Generative Priors and State-Space Models for Facial Beauty Prediction
CV and Pattern Recognition
Makes computers judge faces as beautiful or not.
MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement
CV and Pattern Recognition
Makes blurry satellite pictures sharp and clear.