Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

📅 2025-02-09

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Multimodal foundation models suffer from limited internal capacity, hindering effective modeling of long-duration video/audio sequences and resulting in suboptimal temporal understanding. To address this, we propose the Temporal Working Memory (TWM) cognitive module—the first to integrate cognitive science-inspired working memory mechanisms into multimodal temporal modeling. TWM dynamically selects salient cross-modal temporal segments via query-guided attention, employs a lightweight learnable memory cache, and applies capacity-aware segment-level reweighting—enabling plug-and-play, zero-fine-tuning integration. Fully compatible with standard Transformer architectures, TWM is validated across nine state-of-the-art models, yielding consistent improvements on video captioning, video question answering, and video–text retrieval: average gains of 2.1–4.7 points in BLEU, ROUGE, and Recall@1 metrics. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.

Problem

Research questions and friction points this paper is trying to address.

Enhance temporal modeling in multimodal foundation models

Optimize limited capacity for processing temporal sequences

Improve performance in video and audio analysis tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Working Memory module

Query-guided attention approach

Plug-and-play for existing models

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs