Reconstruction as a Bridge for Event-Based Visual Question Answering
By: Hanyue Lou , Jiayi Zhou , Yang Zhang and more
Potential Business Impact:
Helps computers see in the dark.
Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.
Similar Papers
Event Camera Guided Visual Media Restoration & 3D Reconstruction: A Survey
CV and Pattern Recognition
Improves blurry videos and 3D pictures.
Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding
CV and Pattern Recognition
Lets computers understand long videos faster.
A Survey of 3D Reconstruction with Event Cameras
CV and Pattern Recognition
Helps robots see in fast, dark, or bright places.