Score: 1

UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Published: December 29, 2025 | arXiv ID: 2512.23512v1

By: Fengjiao Chen , Minhao Jing , Weitao Lu and more

BigTech Affiliations: Meituan

Potential Business Impact:

Makes computers understand pictures by drawing them.

Business Areas:
Semantic Web Internet Services

Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.

Country of Origin
🇨🇳 China

Page Count
11 pages

Category
Computer Science:
Computation and Language