From Content to Knowledge: Lightning Fast Long-Video Understanding with Neural Knowledge Representations

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of high computational cost, substantial inference latency, and limited support for multi-turn interaction in long-form video understanding. To overcome these issues, the authors propose encoding video content into compact neural knowledge representations (NKRs) and injecting them in a single step into a frozen vision-language model (VLM) via agent-based knowledge distillation (AKD). This approach enables, for the first time, parameterized encapsulation and lightweight reuse of video semantics, effectively decoupling inference cost from video duration. Evaluated on the LVBench benchmark, the method achieves state-of-the-art performance while reducing end-to-end latency by over two orders of magnitude, thereby significantly enhancing interactive efficiency for long videos.
📝 Abstract
We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights are optimized to encapsulate the video's semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video's knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.
Problem

Research questions and friction points this paper is trying to address.

long-video understanding
neural knowledge representation
inference efficiency
video semantic content
query-based understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Knowledge Representation
Agentic Knowledge Distillation
Long-Video Understanding
Vision-Language Model
Efficient Inference