Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science
By: Juan Jose Rubio Jan, Jack Wu, Julia Ive
Potential Business Impact:
Lets computers find health info in records.
This study applies Large Language Models (LLMs) to two foundational Electronic Health Record (EHR) data science tasks: structured data querying (using programmatic languages, Python/Pandas) and information extraction from unstructured clinical text via a Retrieval Augmented Generation (RAG) pipeline. We test the ability of LLMs to interact accurately with large structured datasets for analytics and the reliability of LLMs in extracting semantically correct information from free text health records when supported by RAG. To this end, we presented a flexible evaluation framework that automatically generates synthetic question and answer pairs tailored to the characteristics of each dataset or task. Experiments were conducted on a curated subset of MIMIC III, (four structured tables and one clinical note type), using a mix of locally hosted and API-based LLMs. Evaluation combined exact-match metrics, semantic similarity, and human judgment. Our findings demonstrate the potential of LLMs to support precise querying and accurate information extraction in clinical workflows.
Similar Papers
Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction
Computation and Language
Helps doctors quickly understand patient history.
Are LLMs Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLMs for Information Retrieval: An Italian Healthcare Use Case
Artificial Intelligence
Helps doctors find patient info from notes.
Large Language Models are Powerful Electronic Health Record Encoders
Machine Learning (CS)
Helps doctors predict health problems using plain text.