Score: 1

Reconstruction as a Bridge for Event-Based Visual Question Answering

Published: December 12, 2025 | arXiv ID: 2512.11510v1

By: Hanyue Lou , Jiayi Zhou , Yang Zhang and more

Potential Business Impact:

Helps computers see in the dark.

Business Areas:

Image Recognition Data and Analytics, Software

Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.