Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenges of temporal sensitivity and context-adaptive retrieval in dynamic video corpora, this paper proposes a Vision-Language Model (VLM)-driven graph-enhanced retrieval framework. The method maps multi-granularity semantic embeddings—generated by a VLM—onto directed temporal graph nodes, explicitly modeling both temporal and semantic dependencies among video segments via graph structure. Crucially, it supports context-aware, multi-turn query refinement through iterative re-ranking. Unlike conventional single-shot retrieval paradigms, our framework enables progressive intent refinement across interaction rounds. Evaluated on multiple benchmark datasets, it achieves a 12.6% improvement in mean Average Precision (mAP) and reduces response latency by 37%, significantly enhancing long-video segment localization accuracy and interactive robustness.

Technology Category

Application Category

📝 Abstract

The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.

Problem

Research questions and friction points this paper is trying to address.

Improving adaptive video retrieval using VLMs

Combining vector similarity with graph structures

Enhancing accuracy in dynamic video environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines vector similarity search with graph structures

Leverages VLM embeddings for initial retrieval

Models contextual relationships among video segments

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13

Authors to Follow