Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key bottlenecks in long-video scene segmentation—including visual centrism bias, isolated shot modeling, insufficient narrative understanding, and poor interpretability—this paper proposes the first end-to-end vision-language model (VLM)-driven sequential scene segmentation framework. Methodologically, it jointly processes frame images, speech transcripts, and metadata; introduces a context-aware sliding window mechanism; and employs causal sequence modeling with token-level confidence estimation to enable fine-grained boundary localization and natural language explanations of segmentation decisions. Its core innovation lies in the first application of VLMs to multimodal scene boundary detection, departing from conventional encoder-based single-frame classification paradigms toward semantically coherent temporal modeling. Evaluated on MovieNet, the method achieves state-of-the-art performance: +6.0 AP and +13.7 F1 over prior work. It further supports adjustable precision–recall trade-offs and controllable explanation generation.

Technology Category

Application Category

📝 Abstract
Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision-recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.
Problem

Research questions and friction points this paper is trying to address.

Segmenting long videos into coherent scenes
Overcoming visual bias and isolated shot classification
Enhancing narrative understanding and explainability in segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned vision-language model for multimodal video segmentation
Sequential prediction with causal dependencies and context-focus window
Extracts confidence scores from token-level logits for controllable trade-offs
🔎 Similar Papers
No similar papers found.
Nimrod Berman
Nimrod Berman
Ben Gurion University
Deep Learning
A
Adam Botach
Amazon Prime Video
E
Emanuel Ben-Baruch
Amazon Prime Video
S
Shunit Haviv Hakimi
Amazon Prime Video
A
Asaf Gendler
Amazon Prime Video
Ilan Naiman
Ilan Naiman
PhD Student, Computer Science at Ben Gurion University
Deep Learning
Erez Yosef
Erez Yosef
Tel Aviv University
Computer VisionDeep LearningArtificial Intelligence
I
Igor Kviatkovsky
Amazon Prime Video