Quantizing Whisper-small: How design choices affect ASR performance
By: Arthur Söhler, Julian Irigoyen, Andreas Søeborg Kirkedal
Potential Business Impact:
Shrinks AI speech models for phones.
Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.
Similar Papers
Quantization for OpenAI's Whisper Models: A Comparative Analysis
Sound
Makes voice typing work faster on small devices.
Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models
Sound
Makes voice assistants work on small devices.
BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing
Audio and Speech Processing
Makes phones talk with tiny, smart programs.