Score: 1

ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation

Published: December 13, 2025 | arXiv ID: 2512.12220v1

By: Minheng Ni , Zhengyuan Yang , Yaowen Zhang and more

Potential Business Impact:

Teaches computers to draw accurate science pictures.

Business Areas:

Image Recognition Data and Analytics, Software

We study professional image generation, where a model must synthesize information-dense, scientifically precise illustrations from technical descriptions rather than merely produce visually plausible pictures. To quantify the progress, we introduce ProImage-Bench, a rubric-based benchmark that targets biology schematics, engineering/patent drawings, and general scientific diagrams. For 654 figures collected from real textbooks and technical reports, we construct detailed image instructions and a hierarchy of rubrics that decompose correctness into 6,076 criteria and 44,131 binary checks. Rubrics are derived from surrounding text and reference figures using large multimodal models, and are evaluated by an automated LMM-based judge with a principled penalty scheme that aggregates sub-question outcomes into interpretable criterion scores. We benchmark several representative text-to-image models on ProImage-Bench and find that, despite strong open-domain performance, the best base model reaches only 0.791 rubric accuracy and 0.553 criterion score overall, revealing substantial gaps in fine-grained scientific fidelity. Finally, we show that the same rubrics provide actionable supervision: feeding failed checks back into an editing model for iterative refinement boosts a strong generator from 0.653 to 0.865 in rubric accuracy and from 0.388 to 0.697 in criterion score. ProImage-Bench thus offers both a rigorous diagnostic for professional image generation and a scalable signal for improving specification-faithful scientific illustrations.

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Computation and Language

Tests AI on real-world law and money problems.

14 Nov 2025 3

88%

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

CV and Pattern Recognition

Makes pictures edit better with more thinking.

3 Nov 2025 0

88%

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Computation and Language

Tests AI on hard professional jobs.

21 Oct 2025 4

View PDF Login to Bookmark

Page Count

34 pages

ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation

Teaches computers to draw accurate science pictures.

Technical Abstract

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge