Score: 2

OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment

Published: September 23, 2025 | arXiv ID: 2509.19018v1

By: Teng Xiao, Zuchao Li, Lefei Zhang

Potential Business Impact:

Lets computers understand and create with pictures and words.

Business Areas:

Semantic Search Internet Services

Recent advances in multimodal large language models (LLMs) have led to significant progress in understanding, generation, and retrieval tasks. However, current solutions often treat these tasks in isolation or require training LLMs from scratch, resulting in high computational costs and limited generalization across modalities. In this work, we present OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module. To address the challenge of task interference, we propose a two-stage decoupled training strategy: supervised fine-tuning and latent space alignment for aligning LLM behavior with multimodal reasoning, and semantic-guided diffusion training to align cross-modal latent spaces via learnable query embeddings. Extensive experiments across a wide range of benchmarks demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks. Moreover, our results highlight the effectiveness of latent space alignment for unifying multimodal modeling under a shared representation space. Code and models are released at https://github.com/xiao-xt/OmniBridge.

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

CV and Pattern Recognition

Teaches computers to understand and create images and videos.

3 Jun 2025 2

89%

LangBridge: Interpreting Image as a Combination of Language Embeddings

CV and Pattern Recognition

Lets computers understand pictures and words together.

25 Mar 2025 0

89%

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Artificial Intelligence

Computer understands talking, seeing, and writing together.

16 Jun 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

19 pages

OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment

Lets computers understand and create with pictures and words.

Technical Abstract

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

LangBridge: Interpreting Image as a Combination of Language Embeddings

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model