Score: 1

Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks

Published: March 19, 2025 | arXiv ID: 2503.15169v2

By: Yuting Guo, Abeed Sarker

Potential Business Impact:

Helps computers find health info from text.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The application of large language models (LLMs) to healthcare information extraction has emerged as a promising approach. This study evaluates the classification performance of five open-source LLMs: GEMMA-3-27B-IT, LLAMA3-70B, LLAMA4-109B, DEEPSEEK-R1-DISTILL-LLAMA-70B, and DEEPSEEK-V3-0324-UD-Q2_K_XL, across six healthcare-related classification tasks involving both social media data (breast cancer, changes in medication regimen, adverse pregnancy outcomes, potential COVID-19 cases) and clinical data (stigma labeling, medication change discussion). We report precision, recall, and F1 scores with 95% confidence intervals for all model-task combinations. Our findings reveal significant performance variability between LLMs, with DeepSeekV3 emerging as the strongest overall performer, achieving the highest F1 scores in four tasks. Notably, models generally performed better on social media tasks compared to clinical data tasks, suggesting potential domain-specific challenges. GEMMA-3-27B-IT demonstrated exceptionally high recall despite its smaller parameter count, while LLAMA4-109B showed surprisingly underwhelming performance compared to its predecessor LLAMA3-70B, indicating that larger parameter counts do not guarantee improved classification results. We observed distinct precision-recall trade-offs across models, with some favoring sensitivity over specificity and vice versa. These findings highlight the importance of task-specific model selection for healthcare applications, considering the particular data domain and precision-recall requirements rather than model size alone. As healthcare increasingly integrates AI-driven text classification tools, this comprehensive benchmarking provides valuable guidance for model selection and implementation while underscoring the need for continued evaluation and domain adaptation of LLMs in healthcare contexts.

Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports

Computers and Society

Helps doctors quickly find heart problems from scans.

29 May 2025 0

92%

Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

Computation and Language

AI helps doctors diagnose illnesses and plan treatments.

14 Apr 2025 0

91%

Large Language Models for Healthcare Text Classification: A Systematic Review

Computation and Language

Helps doctors sort patient notes using smart computers.

3 Mar 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com github.com github.com

Page Count

6 pages

Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks

Helps computers find health info from text.

Technical Abstract

Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports

Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

Large Language Models for Healthcare Text Classification: A Systematic Review