Score: 1

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Published: March 13, 2025 | arXiv ID: 2503.10406v2

By: Yijing Lin , Mengqi Huang , Shuhan Zhuang and more

Potential Business Impact:

Makes one computer program create many kinds of pictures.

Business Areas:

Image Recognition Data and Analytics, Software

Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: https://lyne1.github.io/realgeneral_web/; GitHub Link: https://github.com/Lyne1/RealGeneral

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

CV and Pattern Recognition

Makes one AI create many kinds of pictures.

10 Apr 2025 0

89%

Image Generation as a Visual Planner for Robotic Manipulation

CV and Pattern Recognition

Lets computers plan robot actions by watching videos.

29 Nov 2025 1

88%

Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

CV and Pattern Recognition

Makes one AI do many video and image jobs.

2 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

11 pages

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Makes one computer program create many kinds of pictures.

Technical Abstract

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Image Generation as a Visual Planner for Robotic Manipulation

Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks