Score: 2

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Published: January 7, 2026 | arXiv ID: 2601.03666v1

By: Haonan Chen , Sicheng Gao , Radu Timofte and more

Potential Business Impact:

Lets computers understand text, pictures, and sounds together.

Business Areas:

Semantic Search Internet Services

Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Computation and Language

Computer understands and makes text, images, and sound.

16 Nov 2025 2

88%

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Computation and Language

Trains AI to understand many things faster.

4 Aug 2025 3

88%

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Computation and Language

Trains AI to understand all types of information faster.

4 Aug 2025 3

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇩🇪 China, Germany

Repos / Data Links

github.com

Page Count

14 pages

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Lets computers understand text, pictures, and sounds together.

Technical Abstract

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo