AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval
By: Hyun Jun Kim, Hyeong Yong Choi, Changwon Lim
Potential Business Impact:
Finds sounds in audio using text descriptions.
This report presents the AISTAT team's submission to the language-based audio retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder architecture, where audio and text modalities are encoded separately, and their representations are aligned using contrastive learning. Drawing inspiration from methodologies of the previous year's challenge, we implemented a distillation approach and leveraged large language models (LLMs) for effective data augmentation techniques, including back-translation and LLM mix. Additionally, we incorporated clustering to introduce an auxiliary classification task for further finetuning. Our best single system achieved a mAP@16 of 46.62, while an ensemble of four systems reached a mAP@16 of 48.83 on the Clotho development test split.
Similar Papers
NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025
Computation and Language
Makes computers understand many languages better.
Breaking the Barriers of Text-Hungry and Audio-Deficient AI
Sound
Lets computers understand and speak any language.
The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge
Sound
Makes computers understand many languages spoken.