Score: 0

DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

Published: November 6, 2025 | arXiv ID: 2511.04281v1

By: Yujie Yang , Shuang Li , Jun Ye and more

Potential Business Impact:

Find people in videos using their walk.

Business Areas:

Image Recognition Data and Analytics, Software

Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.

Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification

CV and Pattern Recognition

Helps cameras find people in dark or bright light.

4 Nov 2025 1

89%

Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

CV and Pattern Recognition

Find people in different light using text.

3 Jun 2025 2

89%

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

CV and Pattern Recognition

Helps computers recognize people in videos by their movement.

17 Nov 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

10 pages

DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

Find people in videos using their walk.

Technical Abstract

Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification

Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification