Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
By: Zifan Jiang , Youngjoon Jang , Liliane Momeni and more
Potential Business Impact:
Helps translate sign language videos into text.
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.
Similar Papers
Deep Understanding of Sign Language for Sign to Subtitle Alignment
CV and Pattern Recognition
Makes sign language videos match spoken words better.
SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision
Computation and Language
Translates sign language to many spoken languages.
Sign Language Translation with Sentence Embedding Supervision
Computation and Language
Teaches computers to translate sign language without labels.