Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks
By: Yuting Guo, Abeed Sarker
Potential Business Impact:
Helps computers find health info from text.
The application of large language models (LLMs) to healthcare information extraction has emerged as a promising approach. This study evaluates the classification performance of five open-source LLMs: GEMMA-3-27B-IT, LLAMA3-70B, LLAMA4-109B, DEEPSEEK-R1-DISTILL-LLAMA-70B, and DEEPSEEK-V3-0324-UD-Q2_K_XL, across six healthcare-related classification tasks involving both social media data (breast cancer, changes in medication regimen, adverse pregnancy outcomes, potential COVID-19 cases) and clinical data (stigma labeling, medication change discussion). We report precision, recall, and F1 scores with 95% confidence intervals for all model-task combinations. Our findings reveal significant performance variability between LLMs, with DeepSeekV3 emerging as the strongest overall performer, achieving the highest F1 scores in four tasks. Notably, models generally performed better on social media tasks compared to clinical data tasks, suggesting potential domain-specific challenges. GEMMA-3-27B-IT demonstrated exceptionally high recall despite its smaller parameter count, while LLAMA4-109B showed surprisingly underwhelming performance compared to its predecessor LLAMA3-70B, indicating that larger parameter counts do not guarantee improved classification results. We observed distinct precision-recall trade-offs across models, with some favoring sensitivity over specificity and vice versa. These findings highlight the importance of task-specific model selection for healthcare applications, considering the particular data domain and precision-recall requirements rather than model size alone. As healthcare increasingly integrates AI-driven text classification tools, this comprehensive benchmarking provides valuable guidance for model selection and implementation while underscoring the need for continued evaluation and domain adaptation of LLMs in healthcare contexts.
Similar Papers
Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports
Computers and Society
Helps doctors quickly find heart problems from scans.
Performance of Large Language Models in Supporting Medical Diagnosis and Treatment
Computation and Language
AI helps doctors diagnose illnesses and plan treatments.
Large Language Models for Healthcare Text Classification: A Systematic Review
Computation and Language
Helps doctors sort patient notes using smart computers.