Score: 0

EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs

Published: May 22, 2025 | arXiv ID: 2505.17139v3

By: Wanghan Xu , Xiangyu Zhao , Yuhao Zhou and more

Potential Business Impact:

Tests computers on understanding Earth science.

Business Areas:

Space Travel Transportation

Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on https://huggingface.co/ai-earth .

Toward Open Earth Science as Fast and Accessible as Natural Language

Computational Engineering, Finance, and Science

Lets computers understand and analyze Earth pictures.

21 May 2025 2

90%

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Machine Learning (CS)

Tests AI's weather smarts for better climate predictions.

3 Feb 2025 1

90%

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Machine Learning (CS)

Tests AI's smarts on weather and climate.

3 Feb 2025 1

View PDF Login to Bookmark

Page Count

23 pages

EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs

Tests computers on understanding Earth science.

Technical Abstract

Toward Open Earth Science as Fast and Accessible as Natural Language

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science