Reconstruction as a Bridge for Event-Based Visual Question Answering

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of integrating event cameras with multimodal large language models (MLLMs). We propose a novel “reconstruction-as-bridge” paradigm, wherein image reconstruction serves as an intermediary to efficiently map event streams to frame-based representations—preserving the high temporal resolution of event data while ensuring compatibility with existing frame-based MLLM architectures. Methodologically, we introduce an Adaptive Reconstruction and Tokenization (ART) strategy that enhances reconstruction efficiency without compromising sparsity. Furthermore, we establish EvQA—the first real-world, quantitatively evaluable benchmark for event-driven MLLMs. On EvQA, our approach achieves state-of-the-art performance, significantly outperforming prior methods and substantially improving MLLM capabilities in visual question answering and semantic understanding under challenging conditions such as low illumination and high-speed motion.

Technology Category

Application Category

📝 Abstract

Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.

Problem

Research questions and friction points this paper is trying to address.

Integrating event cameras with multimodal language models for scene understanding

Balancing event data advantages with frame-based model compatibility

Creating a benchmark for evaluating event-based visual question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frame-based Reconstruction and Tokenization method for compatibility

Adaptive Reconstruction and Tokenization leveraging event sparsity

EvQA benchmark for objective event-based MLLM evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow