REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing untrimmed video understanding methods predominantly employ similarity-based greedy visual memory compression, which often neglects the contextual importance of spatiotemporal tokens. To address this, we propose a lightweight, end-to-end trainable LLM adapter featuring a novel context-aware dynamic visual memory compression mechanism. Our approach jointly optimizes a scoring network and a differentiable Top-K operator to enable semantic-aware token selection—preserving critical contextual information while maintaining computational efficiency. The architecture comprises a dynamic visual memory bank and an LLM-Vision co-adaptation module. Evaluated on four large-scale untrimmed video datasets across video classification, video question answering, and video captioning tasks, our method achieves state-of-the-art or leading performance, reducing computational overhead by up to 34%.

Technology Category

Application Category

📝 Abstract

Integrating vision models into large language models (LLMs) has sparked significant interest in creating vision-language foundation models, especially for video understanding. Recent methods often utilize memory banks to handle untrimmed videos for video-level understanding. However, they typically compress visual memory using similarity-based greedy approaches, which can overlook the contextual importance of individual tokens. To address this, we introduce an efficient LLM adapter designed for video-level understanding of untrimmed videos that prioritizes the contextual relevance of spatio-temporal tokens. Our framework leverages scorer networks to selectively compress the visual memory bank and filter spatial tokens based on relevance, using a differentiable Top-K operator for end-to-end training. Across three key video-level understanding tasks$unicode{x2013}$ untrimmed video classification, video question answering, and video captioning$unicode{x2013}$our method achieves competitive or superior results on four large-scale datasets while reducing computational overhead by up to 34%. The code will be available soon on GitHub.

Problem

Research questions and friction points this paper is trying to address.

Improving video-level understanding by prioritizing contextual token relevance

Reducing computational overhead in vision-language models for videos

Enhancing untrimmed video analysis with efficient memory compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Relevance-aware LLM adapter for video understanding

Differentiable Top-K operator for token selection

Scorer networks compress visual memory efficiently

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs