Score: 1

FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers

Published: November 26, 2025 | arXiv ID: 2511.21843v1

By: Sarina Xi , Vishisht Rao , Justin Payan and more

Potential Business Impact:

Helps computers find mistakes in science papers.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers using LLMs, paired with an automated evaluation metric that measures whether models can identify and localize these errors. Developing such a benchmark presents unique challenges that we overcome: ensuring that the inserted errors are well-defined, challenging, and relevant to the content of the paper, avoiding artifacts that would make identification trivial, and designing a scalable, automated evaluation metric. On the resulting benchmark, we evaluate five frontier LLMs: Claude Sonnet 4.5, DeepSeek Reasoner v3.1, Gemini 2.5 Pro, GPT 5, and Grok 4. Among these, GPT 5 is the top-performing model, achieving 39.1% identification accuracy when k=10, where k is the number of top-ranked error text candidates generated by the LLM.

Exploring the Potential and Limitations of Large Language Models for Novice Program Fault Localization

Software Engineering

Helps new coders find mistakes in their programs.

3 Dec 2025 1

90%

Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Computation and Language

AI can't spot bad research logic yet.

29 Aug 2025 2

89%

Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers

Computation and Language

Helps computers write better science paper reviews.

13 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

30 pages

FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers

Helps computers find mistakes in science papers.

Technical Abstract

Exploring the Potential and Limitations of Large Language Models for Novice Program Fault Localization

Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers