M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
By: Huixuan Zhang, Xiaojun Wan
Potential Business Impact:
Makes AI draw pictures that match words better.
Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}
Similar Papers
LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations
CV and Pattern Recognition
Helps AI understand long text descriptions for images.
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
CV and Pattern Recognition
Tests AI that makes pictures from words.
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
CV and Pattern Recognition
Tests how well computers understand pictures and words.