Score: 2

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Published: June 3, 2025 | arXiv ID: 2506.02975v1

By: Yicheng Xiao , Lin Song , Rui Yang and more

BigTech Affiliations: Tencent

Potential Business Impact:

Teaches computers to understand and create images and videos.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Computation and Language

Makes AI understand pictures and words together better.

12 Mar 2025 3

90%

OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment

Machine Learning (CS)

Lets computers understand and create with pictures and words.

23 Sep 2025 2

89%

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

CV and Pattern Recognition

Creates text and pictures faster with less data.

11 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

15 pages

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Teaches computers to understand and create images and videos.

Technical Abstract

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models