Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models
By: Kapil Wanaskar, Gaytri Jena, Magdalini Eirinaki
Potential Business Impact:
Makes AI pictures better with more details.
This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models, with a particular focus on the impact of metadata augmented prompts. Leveraging the DeepFashion-MultiModal dataset, we assess generated outputs through a comprehensive set of quantitative metrics, including Weighted Score, CLIP (Contrastive Language Image Pre-training)-based similarity, LPIPS (Learned Perceptual Image Patch Similarity), FID (Frechet Inception Distance), and retrieval-based measures, as well as qualitative analysis. Our results demonstrate that structured metadata enrichments greatly enhance visual realism, semantic fidelity, and model robustness across diverse text-to-image architectures. While not a traditional recommender system, our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.
Similar Papers
MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models
CV and Pattern Recognition
Tests how well AI makes pictures from words.
Multi-Modal Language Models as Text-to-Image Model Evaluators
CV and Pattern Recognition
Tests AI art generators better with fewer pictures.
ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization
CV and Pattern Recognition
Makes AI better at drawing what you imagine.