BaldWhisper: Faster Whisper with Head Shearing and Layer Merging
By: Yaya Sy, Christophe Cerisara, Irina Illina
Potential Business Impact:
Makes voice assistants work better with fewer words.
Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90% of the original performance while being 48% smaller and 2.15x faster on a MacBook Air M1.
Similar Papers
Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models
Machine Learning (CS)
Makes speech recognition work on small devices.
Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications
Audio and Speech Processing
Lets kids' voices work without sending data away.
Early Attentive Sparsification Accelerates Neural Speech Transcription
Machine Learning (CS)
Speeds up talking-to-text by making audio simpler.