Score: 1

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

Published: May 7, 2025 | arXiv ID: 2505.04601v1

By: Xianhang Li , Yanqing Liu , Haoqin Tu and more

Potential Business Impact:

Makes AI understand pictures and words better.

Business Areas:

Image Recognition Data and Analytics, Software

OpenAI's CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI's CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works -- e.g., CLIPS for training framework and Recap-DataComp-1B for training data -- while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible trade-off between capacity and efficiency in building multimodal models: larger models deliver enhanced multimodal performance, while smaller versions enable lightweight, edge-ready multimodal deployments.

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

CV and Pattern Recognition

Makes computers understand pictures faster and cheaper.

1 Sep 2025 1

91%

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Image and Video Processing

Teaches computers to see and create pictures.

21 Jan 2026 0

86%

VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

Artificial Intelligence

Lets computers build things by seeing pictures.

29 Jun 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

12 pages

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

Makes AI understand pictures and words better.

Technical Abstract

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems