WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
By: Zachary Ellis , Jared Joselowitz , Yash Deo and more
Potential Business Impact:
Makes doctor talk computers safer for patients.
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
Similar Papers
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
Computation and Language
Makes doctor talk machines safer for patients.
Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains
Computation and Language
Makes doctors' notes understandable by computers.
Improving Named Entity Transcription with Contextual LLM-based Revision
Computation and Language
Fixes computer speech errors for important names.