MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of long-form video question answering, where critical evidence is often sparse, transient, and temporally dispersed, and existing approaches relying on isolated frames struggle to capture event-level semantics. To overcome this limitation, the authors propose MemoryCard, a novel framework that segments long videos into semantically coherent units based on topics or events through a self-reading mechanism. Each unit is summarized into an event-level textual description, paired with representative visual moments to form unified multimodal memory cards. This paradigm shifts away from traditional frame-level processing toward topic-aware cue compression and efficient retrieval. Under identical visual token budgets, MemoryCard achieves up to a 21.8% improvement in question-answering accuracy.

📝 Abstract

Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.

Problem

Research questions and friction points this paper is trying to address.

long-video question answering

Vision-Language Models

event-level semantics

video context

evidence sparsity

Innovation

Methods, ideas, or system contributions that make the work stand out.

MemoryCard

event-level semantics

multi-modal clue compression