WhisperKit: On-device Real-time ASR with Billion-Scale Transformers
By: Atila Orhon , Arda Okan , Berkin Durmus and more
Potential Business Impact:
Lets phones understand your voice super fast.
Real-time Automatic Speech Recognition (ASR) is a fundamental building block for many commercial applications of ML, including live captioning, dictation, meeting transcriptions, and medical scribes. Accuracy and latency are the most important factors when companies select a system to deploy. We present WhisperKit, an optimized on-device inference system for real-time ASR that significantly outperforms leading cloud-based systems. We benchmark against server-side systems that deploy a diverse set of models, including a frontier model (OpenAI gpt-4o-transcribe), a proprietary model (Deepgram nova-3), and an open-source model (Fireworks large-v3-turbo).Our results show that WhisperKit matches the lowest latency at 0.46s while achieving the highest accuracy 2.2% WER. The optimizations behind the WhisperKit system are described in detail in this paper.
Similar Papers
AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition
Audio and Speech Processing
Helps people with speech problems talk to computers.
One Whisper to Grade Them All
Computation and Language
Helps computers grade spoken language tests better.
Scalable Offline ASR for Command-Style Dictation in Courtrooms
Audio and Speech Processing
Lets many people talk to computers at once.