SD-VSum: A Method and Dataset for Script-Driven Video Summarization

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces “script-driven video summarization”: given a user-written natural language script, the task is to automatically select the most relevant segments from a long video to generate a personalized summary. Methodologically, we construct the first public benchmark dataset comprising video–summary–script triplets; design a cross-modal attention mechanism and a vision–language alignment fusion module; and extend the VideoXum framework to support multi-granularity, fine-grained script–video alignment annotations. Experiments demonstrate that our approach significantly outperforms existing query-driven and generic video summarization methods across key metrics—including content consistency, script fidelity, and personalization adaptability—thereby enabling precise, controllable modeling of user intent.

Technology Category

Application Category

📝 Abstract
In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ``video, summary and summary description'' can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that relies on the use of a cross-modal attention mechanism for aligning and fusing information from the visual and text modalities. Our experimental evaluations demonstrate the advanced performance of SD-VSum against state-of-the-art approaches for query-driven and generic (unimodal and multimodal) summarization from the literature, and document its capacity to produce video summaries that are adapted to each user's needs about their content.
Problem

Research questions and friction points this paper is trying to address.

Develop script-driven video summarization for user-specific content
Extend dataset with natural language descriptions for training
Propose cross-modal attention network for visual-text alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Script-driven video summarization with user-provided content outlines
Cross-modal attention for visual and text information fusion
Large-scale dataset adaptation with video-summary-description triplets
🔎 Similar Papers
No similar papers found.