Score: 0

Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion

Published: October 1, 2025 | arXiv ID: 2510.08585v1

By: Ahmed Adel Attia, Jing Liu, Carol Espy Wilson

Potential Business Impact:

Helps computers understand speech better by "seeing" mouth movements.

Business Areas:
Speech Recognition Data and Analytics, Software

Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.

Country of Origin
🇺🇸 United States

Page Count
5 pages

Category
Electrical Engineering and Systems Science:
Audio and Speech Processing