Score: 2

Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

Published: May 9, 2025 | arXiv ID: 2505.06046v2

By: Joshua Harris , Fan Grayson , Felix Feldman and more

Potential Business Impact:

Tests if AI knows UK health advice.

Business Areas:

Legal Tech Professional Services

As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, currently little is known about LLM knowledge of UK Government public health information. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries. To create PubHealthBench we extract free text from 687 current UK government guidance documents and implement an automated pipeline for generating MCQA samples. Assessing 24 LLMs on PubHealthBench we find the latest private LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving >90% accuracy in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring >75%. Importantly we find in both setups LLMs have higher accuracy on guidance intended for the general public. Therefore, there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, but additional safeguards or tools may still be needed when providing free form responses on public health topics.

Clinical knowledge in LLMs does not translate to human interactions

Human-Computer Interaction

Helps doctors give better advice by testing AI.

26 Apr 2025 0

90%

RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering

Computation and Language

Helps doctors explain health problems simply.

19 Sep 2025 0

90%

Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases

Computers and Society

AI helps answer everyday health questions accurately.

13 Jun 2025 0

View PDF Login to Bookmark

Repos / Data Links

huggingface.co

Page Count

24 pages

Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

Tests if AI knows UK health advice.

Technical Abstract

Clinical knowledge in LLMs does not translate to human interactions

RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering

Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases