Compositional Image Synthesis with Inference-Time Scaling
By: Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn
Potential Business Impact:
Makes AI pictures match words better.
Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.
Similar Papers
Object-level Visual Prompts for Compositional Image Generation
CV and Pattern Recognition
Lets you put specific pictures into new scenes.
Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation
CV and Pattern Recognition
Makes computers create clearer pictures from words.
Improving Text-to-Image Generation with Input-Side Inference-Time Scaling
Computation and Language
Makes computer pictures better from simple words.