Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion
By: Ahmed Adel Attia, Jing Liu, Carol Espy Wilson
Potential Business Impact:
Helps computers understand speech better by "seeing" mouth movements.
Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.
Similar Papers
Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope
Sound
Helps people learn to speak better by showing tongue movements.
Pitch Accent Detection improves Pretrained Automatic Speech Recognition
Computation and Language
Helps computers understand spoken words better.
Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation
Audio and Speech Processing
Makes computers understand spoken words better.