Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving
By: Jiawei Yang , Ziyu Chen , Yurong You and more
Potential Business Impact:
Makes self-driving cars see better and faster.
We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.
Similar Papers
HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving
CV and Pattern Recognition
Helps self-driving cars see and avoid crashes.
BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving
CV and Pattern Recognition
Helps self-driving cars describe what they see.
Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving
CV and Pattern Recognition
Helps self-driving cars understand scenes better.