Score: 1

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

Published: March 11, 2025 | arXiv ID: 2503.08686v1

By: Jialv Zou , Bencheng Liao , Qian Zhang and more

Potential Business Impact:

Creates text and pictures faster with less data.

Business Areas:

Autonomous Vehicles Transportation

Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at https://github.com/hustvl/OmniMamba

M4V: Multi-Modal Mamba for Text-to-Video Generation

CV and Pattern Recognition

Makes videos from words much faster.

12 Jun 2025 0

90%

UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection

CV and Pattern Recognition

Helps self-driving cars see in 3D better.

15 Mar 2025 0

90%

DA-Mamba: Dialogue-aware selective state-space model for multimodal engagement estimation

Artificial Intelligence

Helps computers understand how people feel in talks.

22 Sep 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

14 pages

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

Creates text and pictures faster with less data.

Technical Abstract

M4V: Multi-Modal Mamba for Text-to-Video Generation

UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection

DA-Mamba: Dialogue-aware selective state-space model for multimodal engagement estimation