FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the computational inefficiency and scalability challenges posed by excessively long contexts in long-video and multi-page document understanding, this paper proposes a lightweight framework that eliminates the need for long-context training. The core innovation lies in decoupling perception from generation: it employs frame- or page-level independent scoring, combined with Top-K adaptive sampling and conditional generation, thereby entirely bypassing long-sequence modeling. The method requires zero fine-tuning and can be seamlessly integrated as a plug-and-play module into existing open-source large multimodal models (LMMs), such as LLaVA-OneVision and InternVL2. Evaluated on MLVU and Video-MME, it boosts InternVL2-76B’s performance by 5.8% and 3.7%, respectively; on MP-DocVQA, it improves accuracy by over 20%, achieving new state-of-the-art results for long-document and long-video understanding.

Technology Category

Application Category

📝 Abstract
There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: https://github.com/NVlabs/FRAG
Problem

Research questions and friction points this paper is trying to address.

Enhancing long video understanding without long context models
Improving multi-page document comprehension using frame selection
Reducing computational costs in training and inference for LMMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frame selection for long input processing
Independent frame scoring avoids long context
Top-K selection enhances performance without fine-tuning
🔎 Similar Papers