SKALD: Learning-Based Shot Assembly for Coherent Multi-Shot Video Creation

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the over-reliance on textual prompts in automatic multi-shot video sequence assembly, proposing a purely vision-driven intelligent shot assembly framework. Methodologically, it introduces a learnable shot assembly (LCA) scoring mechanism trained via a dual-task self-supervised paradigm—combining contrastive learning and feature regression—and employs Vision Transformers to model temporal and semantic relationships among shots, augmented by beam search for optimized sequence generation; optional text enhancement preserves flexibility. Evaluated on VSPD and MSV3C, the method achieves a 48.6% IoU improvement and 43% faster inference. A user study shows 45% of participants prefer its outputs. This is the first end-to-end multi-shot narrative synthesis framework that operates without strong textual supervision, establishing a new paradigm for vision-centric video editing.

Technology Category

Application Category

📝 Abstract

We present SKALD, a multi-shot video assembly method that constructs coherent video sequences from candidate shots with minimal reliance on text. Central to our approach is the Learned Clip Assembly (LCA) score, a learning-based metric that measures temporal and semantic relationships between shots to quantify narrative coherence. We tackle the exponential complexity of combining multiple shots with an efficient beam-search algorithm guided by the LCA score. To train our model effectively with limited human annotations, we propose two tasks for the LCA encoder: Shot Coherence Learning, which uses contrastive learning to distinguish coherent and incoherent sequences, and Feature Regression, which converts these learned representations into a real-valued coherence score. We develop two variants: a base SKALD model that relies solely on visual coherence and SKALD-text, which integrates auxiliary text information when available. Experiments on the VSPD and our curated MSV3C datasets show that SKALD achieves an improvement of up to 48.6% in IoU and a 43% speedup over the state-of-the-art methods. A user study further validates our approach, with 45% of participants favoring SKALD-assembled videos, compared to 22% preferring text-based assembly methods.

Problem

Research questions and friction points this paper is trying to address.

Constructs coherent video sequences from candidate shots

Measures temporal and semantic relationships between shots

Efficiently combines multiple shots using beam-search algorithm

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Learned Clip Assembly (LCA) score

Employs efficient beam-search algorithm

Integrates visual and text coherence

🔎 Similar Papers

A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches