Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning
By: Hong-Yun Lin , Jhen-Ke Lin , Chung-Chun Wang and more
Potential Business Impact:
Tests how well people speak a new language.
Spoken Language Assessment (SLA) estimates a learner's oral proficiency from spontaneous speech. The growing population of L2 English speakers has intensified the demand for reliable SLA, a critical component of Computer Assisted Language Learning (CALL). Existing efforts often rely on cascaded pipelines, which are prone to error propagation, or end-to-end models that often operate on a short audio window, which might miss discourse-level evidence. This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass. Our approach couples multi-target learning with a frozen, Whisper ASR model-based speech prior for acoustic-aware calibration, allowing for jointly learning holistic and trait-level objectives of SLA without resorting to handcrafted features. By coherently processing the entire response session of an L2 speaker, the model excels at predicting holistic oral proficiency. Experiments conducted on the Speak & Improve benchmark demonstrate that our proposed approach outperforms the previous state-of-the-art cascaded system and exhibits robust cross-part generalization, producing a compact deployable grader that is tailored for CALL applications.
Similar Papers
Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning
Computation and Language
Helps computers judge how well people speak.
The NTNU System at the S&I Challenge 2025 SLA Open Track
Computation and Language
Tests speaking skills better by combining sound and words.
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions
Computation and Language
Teaches computers to judge speaking skills from voice.