Controllable Hybrid Captioner for Improved Long-form Video Understanding

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-video text summarization often overemphasizes dynamic action descriptions while neglecting static scene information, limiting question-answering coverage—particularly for queries requiring contextual scene understanding. Method: We propose a switchable hybrid captioning framework that introduces scene-change trigger tokens to jointly model dynamic actions and static scenes within a unified architecture. Leveraging LaViLa and LLaVA foundations, it integrates temporal segmentation modeling with large language model–driven multimodal alignment, avoiding cascaded multi-model deployment. Contribution/Results: Experiments demonstrate significant improvements in complex long-video question answering. Our approach extends answerable query scope to fine-grained semantic levels dependent on scene context, while preserving end-to-end simplicity. It establishes a more comprehensive and controllable multimodal textual representation paradigm for video understanding—enabling explicit intervention in memory construction through token-level scene dynamics control.

Technology Category

Application Category

📝 Abstract
Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.
Problem

Research questions and friction points this paper is trying to address.

Improving long-form video understanding via text summaries
Enriching video captions with static scene descriptions
Enhancing caption quality by partitioning video segments effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive text-based memory from video chunks
Enriched captions with static scene descriptions
Hybrid captioner for action and scene captions
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30