Score: 2

Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Published: November 13, 2025 | arXiv ID: 2511.10075v1

By: Xanh Ho , Yun-Ang Wu , Sunisth Kumar and more

Potential Business Impact:

Helps computers check science facts from charts.

Business Areas:

A/B Testing Data and Analytics

With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models' multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.

MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

Computation and Language

Tests AI to check science papers better.

19 Aug 2025 0

90%

MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

Computation and Language

Helps computers review science papers better.

19 Aug 2025 0

89%

Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables

Artificial Intelligence

Helps computers understand chemistry tables better.

13 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇯🇵 🇫🇷 🇹🇼 Taiwan, Province of China, Japan, France

Repos / Data Links

github.com

Page Count

9 pages

Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Helps computers check science facts from charts.

Technical Abstract

MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables