Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between accuracy and efficiency in text-to-video retrieval on edge devices, this paper proposes ProCLIP. The method introduces (1) a prompt-aware dynamic frame sampling mechanism that leverages textual queries to guide keyframe selection—overcoming the query-agnostic bias inherent in conventional uniform or heuristic sampling—and (2) a lightweight two-stage retrieval strategy: fast coarse retrieval using compact features, followed by fine-grained re-ranking with CLIP. Evaluated on MSR-VTT, ProCLIP achieves R@1 = 49.0, matching state-of-the-art accuracy while reducing inference latency by 75.3%. This substantial acceleration significantly enhances feasibility for edge deployment without compromising retrieval performance.

Technology Category

Application Category

📝 Abstract
Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame sampling methods reduce overhead but suffer from query-agnostic frame selection that biases retrieval results. To address this, we propose ProCLIP, a user-centric framework that achieves state-of-the-art accuracy with significantly improved efficiency. We design a prompt-aware frame sampling strategy that dynamically guides lightweight feature extractors using textual prompts to select semantically relevant frames, overcoming the limitations of existing salient-frame sampling methods which rely on static, query-agnostic selection criteria. Moreover, we adopt a two-stage candidate pruning strategy that combines rapid coarse filtering via a lightweight module with CLIP-powered fine-grained re-ranking, enhancing retrieval efficiency while preserving accuracy. Experiments across benchmarks show ProCLIP achieves 75.3% latency reduction versus baselines while maintaining competitive accuracy, i.e., R@1=49.0 in MSR-VTT dataset. Code is available at https://github.com/tiffylong/ProCLIP.
Problem

Research questions and friction points this paper is trying to address.

Balancing accuracy and efficiency in text-video retrieval
Overcoming query-agnostic frame selection biases
Reducing latency while maintaining retrieval accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-aware frame sampling for relevance
Two-stage pruning with coarse and fine filtering
Dynamic lightweight feature extraction guided by text
🔎 Similar Papers
No similar papers found.
Deyu Zhang
Deyu Zhang
Central South University
Edge IntelligenceEmbodied Intelligence
Tingting Long
Tingting Long
Central South University
mobile computing and edge intelligence
J
Jinrui Zhang
Department of Computer Science and Technology, Tsinghua University, China
L
Ligeng Chen
Operation System Group, Honor Device Co., Ltd, China
Ju Ren
Ju Ren
Department of Computer Science and Technology, Tsinghua University
Internet-of-ThingsEdge Computing/IntelligenceSecurity and Privacy
Y
Yaoxue Zhang
Department of Computer Science and Technology, Tsinghua University, China