WESR: Scaling and Evaluating Word-level Event-Speech Recognition
By: Chenchen Yang , Kexin Huang , Liwei Fan and more
Potential Business Impact:
Lets computers hear and understand sounds like laughing.
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.
Similar Papers
A Benchmark of French ASR Systems Based on Error Severity
Computation and Language
Makes voice-to-text understand what you mean.
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
Computation and Language
Makes doctor talk machines safer for patients.
Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles
Computation and Language
Helps computers understand your feelings in speech.