Score: 1

V-SAT: Video Subtitle Annotation Tool

Published: October 28, 2025 | arXiv ID: 2510.24180v1

By: Arpita Kundu , Joyita Chakraborty , Anindita Desarkar and more

Potential Business Impact:

Fixes video subtitles automatically for better watching.

Business Areas:
Text Analytics Data and Analytics, Software

The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to 3.54 after resolving all language mode issues and F1-scores of ~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing the first comprehensive solution for robust subtitle annotation.

Repos / Data Links

Page Count
5 pages

Category
Computer Science:
Machine Learning (CS)