Reconstruction as a Bridge for Event-Based Visual Question Answering

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of integrating event cameras with multimodal large language models (MLLMs). We propose a novel “reconstruction-as-bridge” paradigm, wherein image reconstruction serves as an intermediary to efficiently map event streams to frame-based representations—preserving the high temporal resolution of event data while ensuring compatibility with existing frame-based MLLM architectures. Methodologically, we introduce an Adaptive Reconstruction and Tokenization (ART) strategy that enhances reconstruction efficiency without compromising sparsity. Furthermore, we establish EvQA—the first real-world, quantitatively evaluable benchmark for event-driven MLLMs. On EvQA, our approach achieves state-of-the-art performance, significantly outperforming prior methods and substantially improving MLLM capabilities in visual question answering and semantic understanding under challenging conditions such as low illumination and high-speed motion.

Technology Category

Application Category

📝 Abstract
Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.
Problem

Research questions and friction points this paper is trying to address.

Integrating event cameras with multimodal language models for scene understanding
Balancing event data advantages with frame-based model compatibility
Creating a benchmark for evaluating event-based visual question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frame-based Reconstruction and Tokenization method for compatibility
Adaptive Reconstruction and Tokenization leveraging event sparsity
EvQA benchmark for objective event-based MLLM evaluation
🔎 Similar Papers
No similar papers found.
H
Hanyue Lou
Peking University
J
Jiayi Zhou
Peking University
Y
Yang Zhang
Peking University
B
Boyu Li
Peking University
Y
Yi Wang
Shanghai AI Laboratory
Guangnan Ye
Guangnan Ye
Fudan University
Computer Vision - Machine Learning
Boxin Shi
Boxin Shi
Peking University
Computer VisionComputational Photography