OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries
By: Sandhanakrishnan Ravichandran , Shivesh Kumar , Rogerio Corga Da Silva and more
Potential Business Impact:
Helps doctors get better health answers from AI.
Evaluating large language models (LLMs) on their ability to generate high-quality, accurate, situationally aware answers to clinical questions requires going beyond conventional benchmarks to assess how these systems behave in complex, high-stake clincal scenarios. Traditional evaluations are often limited to multiple-choice questions that fail to capture essential competencies such as contextual reasoning, awareness and uncertainty handling etc. To address these limitations, we evaluate our agentic, RAG-based clinical support assistant, DR.INFO, using HealthBench, a rubric-driven benchmark composed of open-ended, expert-annotated health conversations. On the Hard subset of 1,000 challenging examples, DR.INFO achieves a HealthBench score of 0.51, substantially outperforming leading frontier LLMs (GPT-5, o3, Grok 3, GPT-4, Gemini 2.5, etc.) across all behavioral axes (accuracy, completeness, instruction following, etc.). In a separate 100-sample evaluation against similar agentic RAG assistants (OpenEvidence, Pathway.md), it maintains a performance lead with a health-bench score of 0.54. These results highlight DR.INFOs strengths in communication, instruction following, and accuracy, while also revealing areas for improvement in context awareness and completeness of a response. Overall, the findings underscore the utility of behavior-level, rubric-based evaluation for building a reliable and trustworthy AI-enabled clinical support assistant.
Similar Papers
Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks
Computation and Language
New AI helps doctors more than old AI.
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases
Computation and Language
Tests AI doctors' thinking for better patient care.
A Multi-Agent Approach to Neurological Clinical Reasoning
Information Retrieval
AI doctors can now solve hard brain puzzles.