Score: 2

ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation

Published: December 1, 2025 | arXiv ID: 2512.01267v1

By: Yuezhang Peng , Yuxin Liu , Yao Li and more

Potential Business Impact:

Makes speech recognition work with less computer power.

Business Areas:

Speech Recognition Data and Analytics, Software

Fine-tuning pre-trained speech foundation models for Automatic Speech Recognition (ASR) is prevalent, yet constrained by substantial GPU memory requirements. We introduce ZO-ASR, a memory-efficient Zeroth-Order (ZO) method that avoids Back-Propagation (BP) and activation memory by estimating gradients via forward passes. When combined with SGD optimizer, ZO-ASR-SGD fine-tunes ASR models using only inference memory. Our evaluation spans supervised and unsupervised tasks. For Supervised Domain Adaptation on Whisper-Large-V3, ZO-ASR's multiple query mechanism enhances robustness and achieves up to an 18.9\% relative Word Error Rate reduction over zero-shot baselines, outperforming existing ZO methods. For unsupervised Test-Time Adaptation on Wav2Vec2-Base, ZO-ASR exhibits moderately lower performance compared to first-order optimizer Adam. Our BP-free approach provides a viable solution for fine-tuning ASR models in computationally resource-constrained or gradient-inaccessible scenarios.

AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition

Audio and Speech Processing

Helps people with speech problems talk to computers.

6 Jun 2025 0

86%

Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation

Computation and Language

Helps computers understand rare words in speech.

25 Aug 2025 0

86%

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection

Sound

Finds broken machines by listening for weird noises.

17 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

7 pages

ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation

Makes speech recognition work with less computer power.

Technical Abstract

AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition

Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection