Score: 1

EBind: a practical approach to space binding

Published: November 18, 2025 | arXiv ID: 2511.14229v1

By: Jim Broadbent , Felix Cohen , Frederik Hvilshøj and more

Potential Business Impact:

Makes AI understand images, sound, and words faster.

Business Areas:

Semantic Search Internet Services

We simplify space binding by focusing on two core components, a single encoder per modality and high-quality data; enabling training state-of-the-art models on a single GPU in a few hours as opposed to multiple days. We present EBind, an Easy, data-centric, and parameter-efficient method to Bind the embedding spaces of multiple contrastive models. We demonstrate that a simple 1.8B-parameter image-text-video-audio-3D model can outperform models 4 to 17x the size. The key to achieving this is a carefully curated dataset of three complementary data sources: i) 6.7M fully-automated multimodal quintuples sourced via SOTA retrieval models, ii) 1M diverse, semi-automated triples annotated by humans as negative, partial, or positive matches, and iii) 3.4M pre-existing captioned data items. We use 13 different evaluations to demonstrate the value of each data source. Due to limitations with existing benchmarks, we further introduce the first high-quality, consensus-annotated zero-shot classification benchmark between audio and PCs. In contrast to related work, we will open-source our code, model weights, and datasets.

RLBind: Adversarial-Invariant Cross-Modal Alignment for Unified Robust Embeddings

Robotics

Makes robots see and hear safely, even if tricked.

17 Sep 2025 0

85%

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

CV and Pattern Recognition

Finds specific video moments using smart searching.

15 Dec 2025 0

85%

MIEB: Massive Image Embedding Benchmark

CV and Pattern Recognition

Tests how well computers understand pictures and words.

14 Apr 2025 1

View PDF Login to Bookmark

Page Count

20 pages

EBind: a practical approach to space binding

Makes AI understand images, sound, and words faster.

Technical Abstract

RLBind: Adversarial-Invariant Cross-Modal Alignment for Unified Robust Embeddings

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

MIEB: Massive Image Embedding Benchmark