MCAF: Efficient Agent-based Video Understanding Framework through Multimodal Coarse-to-Fine Attention Focusing

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long video understanding faces challenges of information redundancy and difficulty in global modeling. This paper proposes MCAF, a training-free agent framework featuring a novel multimodal-cooperative two-stage attention mechanism: “coarse-grained perception → fine-grained focusing.” First, cross-modal coarse filtering localizes potentially relevant temporal segments; then, confidence-driven feedback dynamically reinforces keyframes and subsegments. To enhance long-range coverage, MCAF incorporates dilated temporal expansion and sparse sampling. Crucially, MCAF requires no fine-tuning—attention allocation is optimized solely via inference-time self-reflection and iterative refinement. Evaluated on EgoSchema, Next-QA, IntentQA, and the long-duration (1-hour) Video-MME benchmark, MCAF achieves state-of-the-art performance, improving accuracy by 5.0%, 0.2%, and 0.3% respectively, and significantly outperforming existing agent-based methods. These results demonstrate the feasibility of highly effective long-video understanding without any parameter training.

Technology Category

Application Category

📝 Abstract
Even in the era of rapid advances in large models, video understanding, particularly long videos, remains highly challenging. Compared with textual or image-based information, videos commonly contain more information with redundancy, requiring large models to strategically allocate attention at a global level for accurate comprehension. To address this, we propose MCAF, an agent-based, training-free framework perform video understanding through Multimodal Coarse-to-fine Attention Focusing. The key innovation lies in its ability to sense and prioritize segments of the video that are highly relevant to the understanding task. First, MCAF hierarchically concentrates on highly relevant frames through multimodal information, enhancing the correlation between the acquired contextual information and the query. Second, it employs a dilated temporal expansion mechanism to mitigate the risk of missing crucial details when extracting information from these concentrated frames. In addition, our framework incorporates a self-reflection mechanism utilizing the confidence level of the model's responses as feedback. By iteratively applying these two creative focusing strategies, it adaptively adjusts attention to capture highly query-connected context and thus improves response accuracy. MCAF outperforms comparable state-of-the-art methods on average. On the EgoSchema dataset, it achieves a remarkable 5% performance gain over the leading approach. Meanwhile, on Next-QA and IntentQA datasets, it outperforms the current state-of-the-art standard by 0.2% and 0.3% respectively. On the Video-MME dataset, which features videos averaging nearly an hour in length, MCAF also outperforms other agent-based methods.
Problem

Research questions and friction points this paper is trying to address.

Efficiently understanding long videos with redundancy
Strategically allocating attention for accurate comprehension
Mitigating risk of missing crucial video details
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal coarse-to-fine attention focusing
Dilated temporal expansion mechanism
Self-reflection mechanism for adaptive adjustment
🔎 Similar Papers
No similar papers found.
S
Shiwen Cao
Li Auto Inc., Beijing, China
Zhaoxing Zhang
Zhaoxing Zhang
Huazhong university of science and technology
Visual OdometryRoboticsExploration
J
Junming Jiao
Li Auto Inc., Beijing, China
J
Juyi Qiao
Li Auto Inc., Beijing, China
G
Guowen Song
Li Auto Inc., Beijing, China
R
Rong Shen
Li Auto Inc., Beijing, China