Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
By: Runze He , Yiji Cheng , Tiankai Hang and more
Potential Business Impact:
Makes AI draw pictures exactly how you describe.
In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.
Similar Papers
ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing
CV and Pattern Recognition
Makes AI better at changing pictures with smart thinking.
Interleaving Reasoning for Better Text-to-Image Generation
CV and Pattern Recognition
Makes AI draw pictures that match words better.
Interleaving Reasoning for Better Text-to-Image Generation
CV and Pattern Recognition
Makes AI pictures better by thinking first.