FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation and inference latency of multimodal large language models (MLLMs) in long-form video question answering, which arise from processing excessive input frames. To overcome this, the authors propose FocusGraph, a framework that models video content through graph-structured scene descriptions. It integrates a lightweight, trainable Scene-Caption LLM Selector to identify question-relevant segments and a training-free Patch-wise Sparse-Flow Retention strategy to efficiently select keyframes based on sparse optical flow. By cohesively combining graph-based textual representations, a learnable semantic selector, and sparse motion-aware keyframe extraction into the multimodal QA pipeline, FocusGraph achieves state-of-the-art performance on embodied long-video benchmarks such as FindingDory and HourVideo while significantly reducing inference time.

Technology Category

Application Category

📝 Abstract
The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.
Problem

Research questions and friction points this paper is trying to address.

long video understanding
keyframe selection
embodied question answering
multimodal LLMs
egocentric video
Innovation

Methods, ideas, or system contributions that make the work stand out.

FocusGraph
keyframe selection
graph-based caption
Scene-Caption LLM Selector
Patch-wise Sparse-Flow Retention
🔎 Similar Papers
Tatiana Zemskova
Tatiana Zemskova
IAI MIPT LLC
mathematical modelingneuroscience
S
Solomon Andryushenko
Yandex
I
Ilya Obrubov
FusionBrain Lab
V
Viktoriia Khoruzhaia
MIRAI
E
Ekaterina Eroshenko
MIRAI
E
Ekaterina Derevyanka
FusionBrain Lab
D
Dmitry Yudin
AXXX, MIRAI