HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
By: Yicheng Xiao , Lin Song , Rui Yang and more
Potential Business Impact:
Teaches computers to understand and create images and videos.
With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.
Similar Papers
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
Computation and Language
Makes AI understand pictures and words together better.
OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment
Machine Learning (CS)
Lets computers understand and create with pictures and words.
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
CV and Pattern Recognition
Creates text and pictures faster with less data.