SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
By: Zehua Zhao , Zhixian Huang , Junren Li and more
Potential Business Impact:
Tests if AI can think like a chemist.
Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.
Similar Papers
Chem-R: Learning to Reason as a Chemist
Computational Engineering, Finance, and Science
Helps computers discover new chemicals faster.
ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025
Artificial Intelligence
AI solves hard chemistry problems like a champion.
Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study
Computation and Language
Tests if AI can understand complex chemistry ideas.