Score: 2

ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges

Published: August 6, 2025 | arXiv ID: 2508.04576v1

By: Yue Zhou, Yi Chang, Yuan Wu

Potential Business Impact:

Checks if AI's thinking is trustworthy.

Reasoning is a critical capability of multimodal large language models (MLLMs) for solving complex multimodal tasks, and judging the correctness of reasoning steps is crucial for improving this capability. Recently, MLLM-based process judges (MPJs) have been widely used to assess the correctness of reasoning steps in multimodal tasks. Therefore, evaluating MPJs is important for identifying their limitations and guiding future improvements. However, existing benchmarks for MPJs mainly focus on tasks such as step correctness classification and reasoning process search, while overlooking a key aspect: whether the confidence scores produced by MPJs at the step level are reliable. To address this gap, we propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs. Our benchmark constructs three types of adversarially perturbed reasoning steps: Synonym Substitution, Syntactic Transformation, and Image Perturbation, to test the robustness of MPJ confidence under perturbations. In addition, we introduce three novel evaluation metrics: Confidence Robustness Score (CRS), Confidence Sensitivity Score (CSS), and Confidence Calibration Score (CCS), which evaluate robustness, sensitivity, and calibration, respectively. We evaluate 14 state-of-the-art MLLMs, including both proprietary and open-source models. Experiments reveal limitations in current MPJs' confidence performance and offer competitive baselines to support future research.

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Artificial Intelligence

Helps AI check its own science work better.

9 Mar 2025 1

88%

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

Artificial Intelligence

Small AI models can now judge answers better.

20 Nov 2025 1

88%

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Software Engineering

Makes AI better at knowing when it's right.

4 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

16 pages

ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges

Checks if AI's thinking is trustworthy.

Technical Abstract

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs