EBind: a practical approach to space binding
By: Jim Broadbent , Felix Cohen , Frederik Hvilshøj and more
Potential Business Impact:
Makes AI understand images, sound, and words faster.
We simplify space binding by focusing on two core components, a single encoder per modality and high-quality data; enabling training state-of-the-art models on a single GPU in a few hours as opposed to multiple days. We present EBind, an Easy, data-centric, and parameter-efficient method to Bind the embedding spaces of multiple contrastive models. We demonstrate that a simple 1.8B-parameter image-text-video-audio-3D model can outperform models 4 to 17x the size. The key to achieving this is a carefully curated dataset of three complementary data sources: i) 6.7M fully-automated multimodal quintuples sourced via SOTA retrieval models, ii) 1M diverse, semi-automated triples annotated by humans as negative, partial, or positive matches, and iii) 3.4M pre-existing captioned data items. We use 13 different evaluations to demonstrate the value of each data source. Due to limitations with existing benchmarks, we further introduce the first high-quality, consensus-annotated zero-shot classification benchmark between audio and PCs. In contrast to related work, we will open-source our code, model weights, and datasets.
Similar Papers
RLBind: Adversarial-Invariant Cross-Modal Alignment for Unified Robust Embeddings
Robotics
Makes robots see and hear safely, even if tricked.
Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion
CV and Pattern Recognition
Finds specific video moments using smart searching.
MIEB: Massive Image Embedding Benchmark
CV and Pattern Recognition
Tests how well computers understand pictures and words.