MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
By: Chengyao Wang , Zhisheng Zhong , Bohao Peng and more
Potential Business Impact:
Computer talks like you, understands everything.
We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.
Similar Papers
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Artificial Intelligence
Computer understands talking, seeing, and writing together.
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Computation and Language
Trains AI to understand all types of information faster.
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Computation and Language
Teaches computers to understand all kinds of information faster.