UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?
By: Fengjiao Chen , Minhao Jing , Weitao Lu and more
Potential Business Impact:
Makes computers understand pictures by drawing them.
Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.
Similar Papers
Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
CV and Pattern Recognition
Computers learn to see and write better together.
UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
CV and Pattern Recognition
Makes computers see and create pictures from words.
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
CV and Pattern Recognition
Makes computers draw pictures from descriptions.