Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine
By: Anastasia Kuznetsova , Inseon Jang , Wootaek Lim and more
Potential Business Impact:
Makes computers understand speech with less data.
Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.
Similar Papers
MelCap: A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression
Sound
Makes music and speech sound clear with less data.
Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates
Sound
Makes voices sound clear even with bad internet.
Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates
Sound
Makes audio clear even with bad internet.