Score: 0

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

Published: May 27, 2025 | arXiv ID: 2506.02012v1

By: Zehua Liu , Xiaolou Li , Li Guo and more

Potential Business Impact:

Makes computers understand talking by watching lips.

Business Areas:

Speech Recognition Data and Analytics, Software

Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed, leading to significant VSR performance improvement.

From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition

Sound

Helps computers understand spoken words from lip movements.

18 Sep 2025 1

90%

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Audio and Speech Processing

Lets one computer understand talking from sound and sight.

10 Nov 2025 2

89%

Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis

CV and Pattern Recognition

Helps doctors understand cancer treatment images better.

26 Jan 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

5 pages

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

Makes computers understand talking by watching lips.

Technical Abstract

From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis