Score: 1

What Makes a Good Generated Image? Investigating Human and Multimodal LLM Image Preference Alignment

Published: September 16, 2025 | arXiv ID: 2509.12750v1

By: Rishab Parthasarathy, Jasmine Collins, Cory Stephenson

BigTech Affiliations: Databricks Massachusetts Institute of Technology

Potential Business Impact:

Helps AI understand what makes pictures look good.

Business Areas:
Visual Search Internet Services

Automated evaluation of generative text-to-image models remains a challenging problem. Recent works have proposed using multimodal LLMs to judge the quality of images, but these works offer little insight into how multimodal LLMs make use of concepts relevant to humans, such as image style or composition, to generate their overall assessment. In this work, we study what attributes of an image--specifically aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style--are important for both LLMs and humans to make judgments on image quality. We first curate a dataset of human preferences using synthetically generated image pairs. We use inter-task correlation between each pair of image quality attributes to understand which attributes are related in making human judgments. Repeating the same analysis with LLMs, we find that the relationships between image quality attributes are much weaker. Finally, we study individual image quality attributes by generating synthetic datasets with a high degree of control for each axis. Humans are able to easily judge the quality of an image with respect to all of the specific image quality attributes (e.g. high vs. low aesthetic image), however we find that some attributes, such as anatomical accuracy, are much more difficult for multimodal LLMs to learn to judge. Taken together, these findings reveal interesting differences between how humans and multimodal LLMs perceive images.

Country of Origin
🇺🇸 United States

Page Count
27 pages

Category
Computer Science:
CV and Pattern Recognition