Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of temporal sensitivity and context-adaptive retrieval in dynamic video corpora, this paper proposes a Vision-Language Model (VLM)-driven graph-enhanced retrieval framework. The method maps multi-granularity semantic embeddings—generated by a VLM—onto directed temporal graph nodes, explicitly modeling both temporal and semantic dependencies among video segments via graph structure. Crucially, it supports context-aware, multi-turn query refinement through iterative re-ranking. Unlike conventional single-shot retrieval paradigms, our framework enables progressive intent refinement across interaction rounds. Evaluated on multiple benchmark datasets, it achieves a 12.6% improvement in mean Average Precision (mAP) and reduces response latency by 37%, significantly enhancing long-video segment localization accuracy and interactive robustness.

Technology Category

Application Category

📝 Abstract
The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.
Problem

Research questions and friction points this paper is trying to address.

Improving adaptive video retrieval using VLMs
Combining vector similarity with graph structures
Enhancing accuracy in dynamic video environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines vector similarity search with graph structures
Leverages VLM embeddings for initial retrieval
Models contextual relationships among video segments
🔎 Similar Papers
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13
Yicheng Duan
Yicheng Duan
Case Western Reserve University
Embodied AICV
X
Xi Huang
Computer and Data Sciences, School of Engineering, Case Western Reserve University, Cleveland, Ohio, 44106, USA
D
Duo Chen
Computer and Data Sciences, School of Engineering, Case Western Reserve University, Cleveland, Ohio, 44106, USA