Score: 1

Scalable Frameworks for Real-World Audio-Visual Speech Recognition

Published: December 16, 2025 | arXiv ID: 2512.14083v1

By: Sungnyun Kim

Potential Business Impact:

Helps computers understand speech even with noise.

Business Areas:

Speech Recognition Data and Analytics, Software

The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system's functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Audio and Speech Processing

Lets one computer understand talking from sound and sight.

10 Nov 2025 2

91%

AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition

Multimedia

Helps computers understand talking even with loud noise.

11 Aug 2025 1

91%

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

Audio and Speech Processing

Helps computers understand talking in loud places.

18 Jan 2026 0

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com github.com

Page Count

123 pages

Scalable Frameworks for Real-World Audio-Visual Speech Recognition

Helps computers understand speech even with noise.

Technical Abstract

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition