BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving
By: Felix Brandstaetter , Erik Schuetz , Katharina Winter and more
Potential Business Impact:
Helps self-driving cars describe what they see.
Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5\% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) - to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.
Similar Papers
BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving
Robotics
Helps self-driving cars understand and follow directions.
Vehicle-to-Infrastructure Collaborative Spatial Perception via Multimodal Large Language Models
Machine Learning (CS)
Helps cars talk to each other better, even in bad weather.
ChatBEV: A Visual Language Model that Understands BEV Maps
CV and Pattern Recognition
Helps self-driving cars understand roads better.