Score: 1

Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

Published: December 11, 2025 | arXiv ID: 2512.10947v1

By: Jiawei Yang , Ziyu Chen , Yurong You and more

Potential Business Impact:

Makes self-driving cars see better and faster.

Business Areas:

Motion Capture Media and Entertainment, Video

We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.

HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars see and avoid crashes.

10 Nov 2025 2

87%

BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars describe what they see.

25 Jul 2025 1

87%

Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars understand scenes better.

17 Nov 2025 1

View PDF Login to Bookmark

Page Count

9 pages

Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

Makes self-driving cars see better and faster.

Technical Abstract

HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving

BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving