MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses two key challenges in video moment retrieval (MR) and highlight detection (HD): the entanglement of temporal motion and spatial semantic modeling, and performance degradation due to sparse annotations. To this end, we propose Motion-Semantic DETR—a novel framework built upon the DETR architecture that explicitly decouples intra-modal motion–semantic correlations and introduces a cross-task correlation learning mechanism. Furthermore, we design generative data augmentation and contrastive denoising training to mitigate annotation sparsity. Leveraging a query-guided encoder-decoder structure, our method fuses multimodal features to enable precise, text-driven moment localization and highlight boundary segmentation. Extensive experiments demonstrate state-of-the-art performance across four major benchmarks, with significant improvements in both localization accuracy and boundary consistency.

Technology Category

Application Category

📝 Abstract

Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at https://github.com/snailma0229/MS-DETR.git.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video moment retrieval and highlight detection via motion-semantic learning

Addressing sparsity in motion and semantics dimensions of MR/HD datasets

Improving query-guided localization and highlight boundary delineation in videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint motion-semantic learning for video retrieval

Disentangled intra-modal correlations in encoder

Contrastive denoising learning for robustness

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs