Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss
By: Jiawen Huang , Felipe Sousa , Emir Demirel and more
Potential Business Impact:
Helps computers write down song lyrics automatically.
Automatic Lyrics Transcription (ALT) aims to recognize lyrics from singing voices, similar to Automatic Speech Recognition (ASR) for spoken language, but faces added complexity due to domain-specific properties of the singing voice. While foundation ASR models show robustness in various speech tasks, their performance degrades on singing voice, especially in the presence of musical accompaniment. This work focuses on this performance gap and explores Low-Rank Adaptation (LoRA) for ALT, investigating both single-domain and dual-domain fine-tuning strategies. We propose using a consistency loss to better align vocal and mixture encoder representations, improving transcription on mixture without relying on singing voice separation. Our results show that while na\"ive dual-domain fine-tuning underperforms, structured training with consistency loss yields modest but consistent gains, demonstrating the potential of adapting ASR foundation models for music.
Similar Papers
Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper
Sound
Helps computers write song lyrics from music.
Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion
Computation and Language
Finds fake music by listening to singing.
Melody-Lyrics Matching with Contrastive Alignment Loss
Audio and Speech Processing
Finds matching words for a song's tune.