Score: 0

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Published: November 16, 2025 | arXiv ID: 2511.12635v1

By: Lech Madeyski, Barbara Kitchenham, Martin Shepperd

Potential Business Impact:

Helps AI find important research papers better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Context: Large language models (LLMs) are released faster than users' ability to evaluate them rigorously. When LLMs underpin research, such as identifying relevant literature for systematic reviews (SRs), robust empirical assessment is essential. Objective: We identify and discuss key challenges in assessing LLM performance for selecting relevant literature, identify good (evaluation) practices, and propose recommendations. Method: Using a recent large-scale study as an example, we identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in SRs. We analyzed 27 additional papers investigating this issue, extracted the performance metrics, and found both good practices and widespread problems, especially with the use and reporting of performance metrics for SR screening. Results: Major weaknesses included: i) a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance, e.g., the use of Accuracy, ii) a failure to consider the impact of lost evidence when making claims concerning workload savings, and iii) pervasive failure to report the full confusion matrix (or performance metrics from which it can be reconstructed) which is essential for future meta-analyses. On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built. Conclusions: SR screening evaluations should prioritize lost evidence/recall alongside chance-anchored and cost-sensitive Weighted MCC (WMCC) metric, report complete confusion matrices, treat unclassifiable outputs as referred-back positives for assessment, adopt leakage-aware designs with non-LLM baselines and open artifacts, and ground conclusions in cost-benefit analysis where FNs carry higher penalties than FPs.

SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews

Software Engineering

Tests AI for finding research papers faster.

25 Jul 2025 0

91%

Objective Metrics for Evaluating Large Language Models Using External Data Sources

Computation and Language

Tests computer smarts fairly and without bias.

1 Aug 2025 0

91%

Accelerating Discovery: Rapid Literature Screening with LLMs

Software Engineering

Helps scientists find important papers faster.

16 Sep 2025 1

View PDF Login to Bookmark

Page Count

21 pages

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Helps AI find important research papers better.

Technical Abstract

SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews

Objective Metrics for Evaluating Large Language Models Using External Data Sources

Accelerating Discovery: Rapid Literature Screening with LLMs