Score: 2

USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

Published: December 15, 2025 | arXiv ID: 2512.13415v1

By: Ahmed Abul Hasanaath, Hamzah Luqman

Potential Business Impact:

Helps computers understand sign language from videos.

Business Areas:

Motion Capture Media and Entertainment, Video

Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM

UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

CV and Pattern Recognition

Helps computers understand 3D movement from messy data.

20 Aug 2025 2

89%

STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models

CV and Pattern Recognition

Helps self-driving cars understand traffic better.

19 Aug 2025 0

88%

UniSTFormer: Unified Spatio-Temporal Lightweight Transformer for Efficient Skeleton-Based Action Recognition

CV and Pattern Recognition

Makes computers understand human movements better, faster.

12 Aug 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

10 pages

USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

Helps computers understand sign language from videos.

Technical Abstract

UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models

UniSTFormer: Unified Spatio-Temporal Lightweight Transformer for Efficient Skeleton-Based Action Recognition