Score: 0

Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion

Published: October 1, 2025 | arXiv ID: 2510.08585v1

By: Ahmed Adel Attia, Jing Liu, Carol Espy Wilson

Potential Business Impact:

Helps computers understand speech better by "seeing" mouth movements.

Business Areas:

Speech Recognition Data and Analytics, Software

Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.

Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope

Sound

Helps people learn to speak better by showing tongue movements.

17 Apr 2025 0

87%

Pitch Accent Detection improves Pretrained Automatic Speech Recognition

Computation and Language

Helps computers understand spoken words better.

6 Aug 2025 3

86%

Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation

Audio and Speech Processing

Makes computers understand spoken words better.

11 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

5 pages

Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion

Helps computers understand speech better by "seeing" mouth movements.

Technical Abstract

Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope

Pitch Accent Detection improves Pretrained Automatic Speech Recognition

Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation