Distill Video Datasets into Images

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Video dataset distillation faces bottlenecks including parameter explosion and poor generalization, primarily induced by the temporal dimension. To address this, we propose Single-Frame Video Distillation (SFVD): a novel framework that encodes full video semantics into a single high-discriminative image frame while freezing the parameters of the pre-trained video model and optimizing only the frame content. SFVD introduces differentiable interpolation to reconstruct pseudo-video sequences and incorporates a channel-reshaping layer to integrate authentic temporal information. It further employs a feature-matching-based distillation objective coupled with gradient-constrained optimization. This work is the first to identify and circumvent temporal modeling-induced parameter redundancy in video distillation, establishing the “single-frame representation of video” paradigm. On benchmarks including MiniUCF, SFVD consistently outperforms state-of-the-art methods—achieving up to a 5.3% accuracy gain—while yielding smaller dataset size, higher training efficiency, and superior cross-domain generalization.

Technology Category

Application Category

📝 Abstract

Dataset distillation aims to synthesize compact yet informative datasets that allow models trained on them to achieve performance comparable to training on the full dataset. While this approach has shown promising results for image data, extending dataset distillation methods to video data has proven challenging and often leads to suboptimal performance. In this work, we first identify the core challenge in video set distillation as the substantial increase in learnable parameters introduced by the temporal dimension of video, which complicates optimization and hinders convergence. To address this issue, we observe that a single frame is often sufficient to capture the discriminative semantics of a video. Leveraging this insight, we propose Single-Frame Video set Distillation (SFVD), a framework that distills videos into highly informative frames for each class. Using differentiable interpolation, these frames are transformed into video sequences and matched with the original dataset, while updates are restricted to the frames themselves for improved optimization efficiency. To further incorporate temporal information, the distilled frames are combined with sampled real videos from real videos during the matching process through a channel reshaping layer. Extensive experiments on multiple benchmarks demonstrate that SFVD substantially outperforms prior methods, achieving improvements of up to 5.3% on MiniUCF, thereby offering a more effective solution.

Problem

Research questions and friction points this paper is trying to address.

Distill video datasets into compact informative images

Address increased parameters from video temporal dimension

Improve optimization efficiency by focusing on key frames

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills videos into single informative frames per class

Uses differentiable interpolation to create video sequences

Combines distilled frames with real videos via channel reshaping

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding