4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
By: Chiao-An Yang , Ryo Hachiuma , Sifei Liu and more
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
Similar Papers
R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
CV and Pattern Recognition
Lets robots remember and reason about places.
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
CV and Pattern Recognition
Tests AI's grasp of video movements and fixes gaps
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
CV and Pattern Recognition
Helps computers understand moving objects like humans do.