Score: 1

Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization

Published: October 8, 2025 | arXiv ID: 2510.08618v1

By: Rui Hu , Delai Qiu , Yining Wang and more

Potential Business Impact:

Helps computers understand lectures by reading slides.

Business Areas:

Speech Recognition Data and Analytics, Software

Automatic speech recognition (ASR) systems often struggle with domain-specific terminology, especially in specialized settings such as academic lectures. To address this, we define the SlideASR task, which leverages the rich visual information from presentation slides to improve transcription accuracy. Existing pipeline methods for this task tend to be complex and underperform. Although omni-modal large language models (OLLMs) provide a promising end-to-end framework, they frequently fail in practice by degenerating into simple optical character recognition (OCR) systems. To overcome this, we propose Visually-Anchored Policy Optimization (VAPO), a novel post-training method designed to control the model's reasoning process. Drawing on the Chain-of-Thought reasoning paradigm, VAPO enforces a structured "Look before Transcription" procedure using a <think><answer> format. Specifically, the model first performs OCR on the slide content within the think step, then generates the transcription by referencing this recognized visual information in the answer step. This reasoning process is optimized via reinforcement learning with four distinct rewards targeting format compliance, OCR accuracy, ASR quality, and visual anchoring consistency. To support further research, we construct SlideASR-Bench, a new entity-rich benchmark consisting of a synthetic dataset for training and testing, and a challenging real-world set for evaluation. Extensive experiments demonstrate that VAPO significantly improves recognition of domain-specific terms, establishing an effective end-to-end paradigm for SlideASR.

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

CV and Pattern Recognition

Lets computers "hear" words from lip movements.

25 Jul 2025 0

86%

Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

Artificial Intelligence

Helps computers understand talks with slides better.

15 Oct 2025 1

86%

Visual-Aware Speech Recognition for Noisy Scenarios

Computation and Language

Helps computers hear speech in noisy places.

9 Apr 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com

Page Count

16 pages

Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization

Helps computers understand lectures by reading slides.

Technical Abstract

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

Visual-Aware Speech Recognition for Noisy Scenarios