First, do NOHARM: towards clinically safe large language models
By: David Wu , Fateme Nateghi Haredasht , Saloni Kumar Maharaj and more
Potential Business Impact:
Tests AI to prevent dangerous medical advice.
Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary-care-to-specialist consultation cases to measure harm frequency and severity from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, severe harm occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harms of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach reduces harm compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement.
Similar Papers
Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment
Artificial Intelligence
Makes AI doctors safer by catching bad advice.
RxSafeBench: Identifying Medication Safety Issues of Large Language Models in Simulated Consultation
Artificial Intelligence
Tests AI to help doctors give safe medicine.
Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation
Computation and Language
Makes AI safer and less likely to say bad things.