Test-time Prompt Refinement for Text-to-Image Models
By: Mohammad Abdul Hafeez Khan , Yash Jain , Siddhartha Bhattacharyya and more
Potential Business Impact:
Fixes AI art mistakes by checking its own work.
Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.
Similar Papers
Improving Text-to-Image Generation with Input-Side Inference-Time Scaling
Computation and Language
Makes computer pictures better from simple words.
Improving Text-to-Image Generation with Input-Side Inference-Time Scaling
Computation and Language
Makes computer pictures better from simple words.
Iterative Prompt Refinement for Safer Text-to-Image Generation
CV and Pattern Recognition
Makes AI art safer by checking pictures and words.