Score: 1

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Published: October 15, 2025 | arXiv ID: 2510.13721v2

By: Run Luo , Xiaobo Xia , Lu Wang and more

Potential Business Impact:

Lets computers understand and create any mix of text, images, video, sound.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Computation and Language

Lets computers understand and create text, images, and sound.

15 Oct 2025 1

89%

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Artificial Intelligence

Lets computers understand all kinds of information together.

4 Nov 2025 1

89%

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Artificial Intelligence

Lets computers understand many things together, like pictures and words.

4 Nov 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com

Page Count

36 pages

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Lets computers understand and create any mix of text, images, video, sound.

Technical Abstract

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything