Score: 0

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Published: April 10, 2025 | arXiv ID: 2504.07960v2

By: Zhong-Yu Li , Ruoyi Du , Juncheng Yan and more

Potential Business Impact:

Makes one AI create many kinds of pictures.

Business Areas:

Visual Search Internet Services

Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

CV and Pattern Recognition

Makes computers see and create pictures from words.

21 Nov 2025 0

89%

VUGEN: Visual Understanding priors for GENeration

CV and Pattern Recognition

Makes computers draw pictures from words.

8 Oct 2025 1

89%

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

CV and Pattern Recognition

Makes one computer program create many kinds of pictures.

13 Mar 2025 1

View PDF Login to Bookmark

Page Count

18 pages

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Makes one AI create many kinds of pictures.

Technical Abstract

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

VUGEN: Visual Understanding priors for GENeration

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models