Score: 1

DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models

Published: June 9, 2025 | arXiv ID: 2506.07375v1

By: Xunjie He , Christina Dao Wen Lee , Meiling Wang and more

Potential Business Impact:

Helps cars see and track all road users.

Business Areas:

Image Recognition Data and Analytics, Software

Collaborative perception plays a crucial role in enhancing environmental understanding by expanding the perceptual range and improving robustness against sensor failures, which primarily involves collaborative 3D detection and tracking tasks. The former focuses on object recognition in individual frames, while the latter captures continuous instance tracklets over time. However, existing works in both areas predominantly focus on the vehicle superclass, lacking effective solutions for both multi-class collaborative detection and tracking. This limitation hinders their applicability in real-world scenarios, which involve diverse object classes with varying appearances and motion patterns. To overcome these limitations, we propose a multi-class collaborative detection and tracking framework tailored for diverse road users. We first present a detector with a global spatial attention fusion (GSAF) module, enhancing multi-scale feature learning for objects of varying sizes. Next, we introduce a tracklet RE-IDentification (REID) module that leverages visual semantics with a vision foundation model to effectively reduce ID SWitch (IDSW) errors, in cases of erroneous mismatches involving small objects like pedestrians. We further design a velocity-based adaptive tracklet management (VATM) module that adjusts the tracking interval dynamically based on object motion. Extensive experiments on the V2X-Real and OPV2V datasets show that our approach significantly outperforms existing state-of-the-art methods in both detection and tracking accuracy.

Attention-Aware Multi-View Pedestrian Tracking

CV and Pattern Recognition

Tracks people better even when they hide.

3 Apr 2025 1

88%

Intelligent driving vehicle front multi-target tracking and detection based on YOLOv5 and point cloud 3D projection

CV and Pattern Recognition

Helps cars see and track many things at once.

13 Apr 2025 0

88%

RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features

CV and Pattern Recognition

Helps cars "see" better with cameras and radar.

21 Aug 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models

Helps cars see and track all road users.

Technical Abstract

Attention-Aware Multi-View Pedestrian Tracking

Intelligent driving vehicle front multi-target tracking and detection based on YOLOv5 and point cloud 3D projection

RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features