The NTNU System at the S&I Challenge 2025 SLA Open Track
By: Hong-Yun Lin , Tien-Hong Lo , Yu-Hsuan Fang and more
Potential Business Impact:
Tests speaking skills better by combining sound and words.
A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.
Similar Papers
Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning
Computation and Language
Tests how well people speak a new language.
NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025
Computation and Language
Makes computers understand many languages better.
Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation guided Structured Pruning
Audio and Speech Processing
Identifies people's voices from recordings.