CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction
By: Jing Zou , Qingqiu Li , Chenyu Lian and more
Potential Business Impact:
Helps computers find and fix mistakes in X-ray reports.
AI-driven models have shown great promise in detecting errors in radiology reports, yet the field lacks a unified benchmark for rigorous evaluation of error detection and further correction. To address this gap, we introduce CorBenchX, a comprehensive suite for automated error detection and correction in chest X-ray reports, designed to advance AI-assisted quality control in clinical practice. We first synthesize a large-scale dataset of 26,326 chest X-ray error reports by injecting clinically common errors via prompting DeepSeek-R1, with each corrupted report paired with its original text, error type, and human-readable description. Leveraging this dataset, we benchmark both open- and closed-source vision-language models,(e.g., InternVL, Qwen-VL, GPT-4o, o4-mini, and Claude-3.7) for error detection and correction under zero-shot prompting. Among these models, o4-mini achieves the best performance, with 50.6 % detection accuracy and correction scores of BLEU 0.853, ROUGE 0.924, BERTScore 0.981, SembScore 0.865, and CheXbertF1 0.954, remaining below clinical-level accuracy, highlighting the challenge of precise report correction. To advance the state of the art, we propose a multi-step reinforcement learning (MSRL) framework that optimizes a multi-objective reward combining format compliance, error-type accuracy, and BLEU similarity. We apply MSRL to QwenVL2.5-7B, the top open-source model in our benchmark, achieving an improvement of 38.3% in single-error detection precision and 5.2% in single-error correction over the zero-shot baseline.
Similar Papers
XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography
CV and Pattern Recognition
Helps doctors trust AI's medical image guesses.
Generative Large Language Models Trained for Detecting Errors in Radiology Reports
Computation and Language
Finds mistakes in doctor's X-ray reports.
ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding
CV and Pattern Recognition
AI reads X-rays better than doctors.