LET-US: Long Event-Text Understanding of Scenes

📅 2025-08-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) for event cameras struggle to effectively model long-duration event streams, suffering from weak temporal modeling and shallow cross-modal alignment. To address this, we propose an adaptive event compression mechanism coupled with text-guided feature refinement, establishing a two-stage optimization paradigm and a hierarchical clustering strategy—enabling, for the first time, deep semantic alignment between long event sequences and textual descriptions. Our method integrates high-dynamic-range event data with large language models via cross-modal queries, similarity-driven clustering, and large-scale event–text alignment training, significantly enhancing the interpretability and generalizability of event representations within the language embedding space. We achieve state-of-the-art performance across diverse reasoning, description, and retrieval tasks. Furthermore, we release the first large-scale, open-source event–text benchmark dataset to foster community advancement.

Technology Category

Application Category

📝 Abstract
Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (MLLMs) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream--text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively equips our model with the capacity to interpret event-based scenes. To handle the voluminous temporal information inherent in long event streams, we leverage text-guided cross-modal queries for feature reduction, augmented by hierarchical clustering and similarity computation to distill the most representative event features. Moreover, we curate and construct a large-scale event-text aligned dataset to train our model, achieving tighter alignment of event features within the LLM embedding space. We also develop a comprehensive benchmark covering a diverse set of tasks -- reasoning, captioning, classification, temporal localization and moment retrieval. Experimental results demonstrate that LET-US outperforms prior state-of-the-art MLLMs in both descriptive accuracy and semantic comprehension on long-duration event streams. All datasets, codes, and models will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Interpreting long event streams with text using adaptive compression
Bridging modality gap between event streams and text representations
Handling voluminous temporal information in long event sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive compression preserves critical event details
Two-stage optimization bridges event-text modality gap
Text-guided queries distill representative event features
🔎 Similar Papers