Score: 1

RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Published: December 29, 2025 | arXiv ID: 2512.23565v1

By: Hanzheng Li , Xi Fang , Yixuan Li and more

Potential Business Impact:

Helps AI understand chemistry pictures for faster discoveries.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.

QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

Artificial Intelligence

Tests if computers can do math for chemistry.

3 Aug 2025 0

91%

Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model

CV and Pattern Recognition

Lets computers understand chemistry pictures for science.

11 Mar 2025 1

90%

Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams

Computation and Language

Helps AI understand chemistry pictures and words.

17 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇨🇳 United States, China

Page Count

14 pages

RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Helps AI understand chemistry pictures for faster discoveries.

Technical Abstract

QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model

Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams