Score: 0

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Published: January 21, 2026 | arXiv ID: 2601.15369v1

By: Letian Zhang , Sucheng Ren , Yanqing Liu and more

Potential Business Impact:

Teaches computers to see and create pictures.

Business Areas:

Image Recognition Data and Analytics, Software

This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

CV and Pattern Recognition

Makes computers understand pictures faster and cheaper.

1 Sep 2025 1

91%

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

CV and Pattern Recognition

Makes AI understand pictures and words better.

7 May 2025 1

89%

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

CV and Pattern Recognition

Makes pictures match words better for editing.

14 Oct 2025 0

View PDF Login to Bookmark

Page Count

9 pages

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Teaches computers to see and create pictures.

Technical Abstract

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

UniFusion: Vision-Language Model as Unified Encoder in Image Generation