Score: 0

Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches

Published: December 5, 2025 | arXiv ID: 2512.05537v1

By: Namu Park , Farzad Ahmed , Zhaoyi Sun and more

Potential Business Impact:

Helps doctors find hidden sicknesses in scans.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p < 0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.

Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

Computation and Language

Helps doctors know if patients need more scans.

14 Nov 2025 1

90%

Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

CV and Pattern Recognition

AI can find sickness in X-rays.

22 Sep 2025 1

89%

A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool

Computation and Language

Helps AI tools check brain scans more accurately.

30 Oct 2025 0

View PDF Login to Bookmark

Page Count

22 pages

Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches

Helps doctors find hidden sicknesses in scans.

Technical Abstract

Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool