A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models
By: Jan-Hendrik Koch, Jonas Krumme, Konrad Gadzicki
Potential Business Impact:
Makes AI draw pictures with exact objects.
Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.
Similar Papers
Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
CV and Pattern Recognition
Puts specific objects exactly where you want them.
Compositional Image Synthesis with Inference-Time Scaling
CV and Pattern Recognition
Makes AI pictures match words better.
Consistent Image Layout Editing with Diffusion Models
CV and Pattern Recognition
Changes picture layouts while keeping objects looking real.