V-SAT: Video Subtitle Annotation Tool
By: Arpita Kundu , Joyita Chakraborty , Anindita Desarkar and more
Potential Business Impact:
Fixes video subtitles automatically for better watching.
The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to 3.54 after resolving all language mode issues and F1-scores of ~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing the first comprehensive solution for robust subtitle annotation.
Similar Papers
VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
CV and Pattern Recognition
Finds important video moments using words and subtitles.
Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling
Audio and Speech Processing
Helps computers understand spoken words better.
VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
Multimedia
Makes videos talk with matching lip movements.