Score: 0

HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery

Published: December 28, 2025 | arXiv ID: 2512.22899v1

By: Yaping Zhang , Qixuan Zhang , Xingquan Zhang and more

Potential Business Impact:

Tests computers on science from reading to discovery.

Business Areas:

Intelligent Systems Artificial Intelligence, Data and Analytics, Science and Engineering

The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce \textbf{HiSciBench}, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: \textit{Scientific Literacy} (L1), \textit{Literature Parsing} (L2), \textit{Literature-based Question Answering} (L3), \textit{Literature Review Generation} (L4), and \textit{Scientific Discovery} (L5). HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines, including mathematics, physics, chemistry, biology, geography, and astronomy, and supports multimodal inputs including text, equations, figures, and tables, as well as cross-lingual evaluation. Unlike prior benchmarks that assess isolated abilities, HiSciBench provides an integrated, dependency-aware framework that enables detailed diagnosis of model capabilities across different stages of scientific reasoning. Comprehensive evaluations of leading models, including GPT-5, DeepSeek-R1, and several multimodal systems, reveal substantial performance gaps: while models achieve up to 69\% accuracy on basic literacy tasks, performance declines sharply to 25\% on discovery-level challenges. HiSciBench establishes a new standard for evaluating scientific Intelligence and offers actionable insights for developing models that are not only more capable but also more reliable. The benchmark will be publicly released to facilitate future research.

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems

Machine Learning (CS)

Tests if computers can solve science problems.

27 Feb 2025 3

89%

SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models

Machine Learning (CS)

Checks if AI can help science discover new things.

12 Mar 2025 0

89%

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

Artificial Intelligence

Tests if AI can solve hard science problems.

14 Oct 2025 2

View PDF Login to Bookmark

Page Count

13 pages

HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery

Tests computers on science from reading to discovery.

Technical Abstract

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems

SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science