Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the significant performance degradation of existing video grounding methods on low-quality videos and the risk of catastrophic forgetting during fine-tuning. To overcome these limitations, the authors propose Null Subspace Tuning (NST), a novel framework that, for the first time, leverages the geometric properties of null subspaces in video grounding. NST injects learnable residual parameters into the null subspace of a frozen pretrained backbone, integrating a quality-adaptive module and a dual-space reparameterization mechanism to enable input-quality-aware fine-tuning while preserving pretrained knowledge. Evaluated on a newly constructed mixed-quality benchmark, NST substantially outperforms current state-of-the-art methods, effectively enhancing localization accuracy on degraded videos without compromising performance on high-quality inputs.

📝 Abstract

Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods like LoRA can adapt to degraded inputs, they inevitably disrupt pre-trained knowledge. To address this, we propose Null-Space Tuning (NST). This framework exploits the geometric property that adding vectors within the null-space of frozen weights to the layer input does not affect the output. Leveraging this, NST injects learnable residuals into input features that can be selectively invisible to the pre-trained backbone. Specifically, NST combines the Quality-Adaptive Unit and Dual-Space Reparameterization to synthesize these residuals by confining components for HQ inputs to the null-space, while directing restoration components for LQ inputs to the non-null space. As the frozen weights eliminate null-space components, we effectively rectify degraded inputs while preserving pre-trained knowledge for HQ inputs. Extensive experiments show that NST outperforms state-of-the-art methods on our Mixed-Quality benchmark.

Problem

Research questions and friction points this paper is trying to address.

Spatio-Temporal Video Grounding

Low-Quality Videos

Pre-trained Knowledge Preservation

Model Tuning

Video Grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Null-Space Tuning

Knowledge Preservation

Spatio-Temporal Video Grounding