GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the optimization mismatch between continuous surrogate losses and non-differentiable evaluation metrics—such as temporal Intersection over Union (tIoU)—in video moment retrieval. To this end, the authors propose GIRL-DETR, which introduces, for the first time, gradient-isolated reinforcement learning post-training within a lightweight temporal localization framework. The approach freezes the backbone network to preserve the feature manifold while enabling the detection head to directly optimize tIoU through a text-guided gating mechanism and a three-stage progressive reinforcement learning scheme. This design achieves orthogonal decoupling between state representation and metric optimization. Evaluated on Charades-STA, QVHighlights, and TACoS, the method yields significant gains in localization accuracy by updating only a small number of parameters, effectively mitigating suboptimal convergence commonly observed in the late stages of lightweight model training.

📝 Abstract

Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models suffer from a misalignment between continuous surrogate losses and non-differentiable metrics, leading to optimization stagnation during the late stages of training and trapping boundary predictions in suboptimal solutions. Although Reinforcement Learning (RL) post-training successfully optimizes localization results for large models, applying it directly to lightweight networks easily disrupts the fragile feature representations established during the supervised phase. To overcome this optimization bottleneck, we propose Gradient-Isolated Reinforcement Learning for DETR (GIRL-DETR), introducing RL post-training into a lightweight temporal localization framework for the first time. The input video and text features first establish early alignment through Cross-Modal Interaction (CMI) before entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder generates candidate proposals, providing high signal-to-noise ratio inputs for temporal prediction. After the supervised training reaches convergence, the backbone network is frozen to protect the feature manifold, while the detection head directly optimizes the non-differentiable evaluation metric tIoU to enhance localization accuracy through a Three-stage Progressive Reinforcement Learning (TPRL) strategy. This approach achieves an orthogonal decoupling of state representation and metric optimization. Experiments on Charades-STA, QVHighlights, and TACoS demonstrate that GIRL-DETR effectively resolves surrogate loss degradation and achieves substantial accuracy improvements with minimal parameter updates, providing a robust new pathway for RL applications in lightweight VMR models.

Problem

Research questions and friction points this paper is trying to address.

Video Moment Retrieval

Reinforcement Learning

Optimization Bottleneck

Non-differentiable Metric

Lightweight Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-Isolated Reinforcement Learning

Video Moment Retrieval

Three-stage Progressive RL