One Whisper to Grade Them All
By: Nhan Phan , Anusha Porwal , Yaroslav Getman and more
Potential Business Impact:
Helps computers grade spoken language tests better.
We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.
Similar Papers
Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment
Computation and Language
Helps computers judge how well people speak English.
The Interspeech 2025 Speech Accessibility Project Challenge
Artificial Intelligence
Helps computers understand speech from people with disabilities.
Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR
Sound
Helps computers understand non-native English speakers better.