PairUni: Pairwise Training for Unified Multimodal Language Models
By: Jiani Zheng , Zhiyang Teng , Xiangtai Li and more
Potential Business Impact:
Helps AI understand and create text and images.
Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Codes are available at https://github.com/Haochen-Wang409/PairUni.
Similar Papers
PairUni: Pairwise Training for Unified Multimodal Language Models
Computation and Language
Helps computers understand and create text and images.
ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
CV and Pattern Recognition
Makes AI pictures better by using more computer "eyes."
Unified Reward Model for Multimodal Understanding and Generation
CV and Pattern Recognition
Teaches computers to judge pictures and videos better.