TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References
By: Jiahong Yu , Ziqi Wang , Hailiang Zhao and more
Potential Business Impact:
Helps cars find things by how they move.
Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.
Similar Papers
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
CV and Pattern Recognition
Helps computers understand 3D objects and their places.
Visual Grounding from Event Cameras
CV and Pattern Recognition
Lets computers understand spoken words about moving things.
Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras
CV and Pattern Recognition
Lets cars understand spoken commands about surroundings.