🤖 AI Summary
This work addresses the challenge of integrating event cameras with multimodal large language models (MLLMs). We propose a novel “reconstruction-as-bridge” paradigm, wherein image reconstruction serves as an intermediary to efficiently map event streams to frame-based representations—preserving the high temporal resolution of event data while ensuring compatibility with existing frame-based MLLM architectures. Methodologically, we introduce an Adaptive Reconstruction and Tokenization (ART) strategy that enhances reconstruction efficiency without compromising sparsity. Furthermore, we establish EvQA—the first real-world, quantitatively evaluable benchmark for event-driven MLLMs. On EvQA, our approach achieves state-of-the-art performance, significantly outperforming prior methods and substantially improving MLLM capabilities in visual question answering and semantic understanding under challenging conditions such as low illumination and high-speed motion.
📝 Abstract
Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.