🤖 AI Summary
This work addresses the over-reliance on textual prompts in automatic multi-shot video sequence assembly, proposing a purely vision-driven intelligent shot assembly framework. Methodologically, it introduces a learnable shot assembly (LCA) scoring mechanism trained via a dual-task self-supervised paradigm—combining contrastive learning and feature regression—and employs Vision Transformers to model temporal and semantic relationships among shots, augmented by beam search for optimized sequence generation; optional text enhancement preserves flexibility. Evaluated on VSPD and MSV3C, the method achieves a 48.6% IoU improvement and 43% faster inference. A user study shows 45% of participants prefer its outputs. This is the first end-to-end multi-shot narrative synthesis framework that operates without strong textual supervision, establishing a new paradigm for vision-centric video editing.
📝 Abstract
We present SKALD, a multi-shot video assembly method that constructs coherent video sequences from candidate shots with minimal reliance on text. Central to our approach is the Learned Clip Assembly (LCA) score, a learning-based metric that measures temporal and semantic relationships between shots to quantify narrative coherence. We tackle the exponential complexity of combining multiple shots with an efficient beam-search algorithm guided by the LCA score. To train our model effectively with limited human annotations, we propose two tasks for the LCA encoder: Shot Coherence Learning, which uses contrastive learning to distinguish coherent and incoherent sequences, and Feature Regression, which converts these learned representations into a real-valued coherence score. We develop two variants: a base SKALD model that relies solely on visual coherence and SKALD-text, which integrates auxiliary text information when available. Experiments on the VSPD and our curated MSV3C datasets show that SKALD achieves an improvement of up to 48.6% in IoU and a 43% speedup over the state-of-the-art methods. A user study further validates our approach, with 45% of participants favoring SKALD-assembled videos, compared to 22% preferring text-based assembly methods.