Score: 2

Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Published: August 2, 2025 | arXiv ID: 2508.01159v1

By: Liam G. McCoy , Fateme Nateghi Haredasht , Kanav Chopra and more

BigTech Affiliations: Stanford University

Potential Business Impact:

Helps doctors write patient notes faster.

This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models -- including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro -- for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2\%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model's ability to prioritize clinically salient information within the time constraints of real-world physician communication.

Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Computation and Language

New AI helps doctors more than old AI.

1 Dec 2025 1

92%

Evaluating Large Language Models for Evidence-Based Clinical Question Answering

Computation and Language

Helps doctors answer patient questions better.

13 Sep 2025 2

92%

Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science

Computation and Language

Lets computers find health info in records.

28 Jan 2026 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇨🇦 Canada, United States

Page Count

16 pages

Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Helps doctors write patient notes faster.

Technical Abstract

Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Evaluating Large Language Models for Evidence-Based Clinical Question Answering

Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science