Score: 1

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

Published: August 19, 2025 | arXiv ID: 2508.13938v1

By: Jiacheng Ruan , Dan Jiang , Xian Gao and more

Potential Business Impact:

Tests AI's science smarts in many languages.

Business Areas:

STEM Education Education, Science and Engineering

Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems

Machine Learning (CS)

Tests if computers can solve science problems.

27 Feb 2025 3

91%

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Computation and Language

Tests how well computers "see" and understand pictures.

5 Nov 2025 2

89%

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

CV and Pattern Recognition

Tests computers' feelings and why they feel them.

11 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

9 pages

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

Tests AI's science smarts in many languages.

Technical Abstract

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models