π€ AI Summary
This work addresses the challenges of high storage overhead and privacy risks associated with large-scale datasets in remote sensing image interpretation by introducing dataset distillation to this domain for the first time. The authors propose a discriminative prototype-guided diffusion distillation framework that leverages text-to-image diffusion models, classifier guidance, and semantic prototype selection to generate high-quality, diverse synthetic samples in the latent space. By integrating vision-language models with textual description aggregation techniques, the method enhances both the realism and discriminability of the synthesized images. Experimental results on three high-resolution remote sensing scene classification benchmarks demonstrate that the distilled dataset effectively supports downstream task training while significantly reducing reliance on original data.
π Abstract
Recent years have witnessed the remarkable success of deep learning in remote sensing image interpretation, driven by the availability of large-scale benchmark datasets. However, this reliance on massive training data also brings two major challenges: (1) high storage and computational costs, and (2) the risk of data leakage, especially when sensitive categories are involved. To address these challenges, this study introduces the concept of dataset distillation into the field of remote sensing image interpretation for the first time. Specifically, we train a text-to-image diffusion model to condense a large-scale remote sensing dataset into a compact and representative distilled dataset. To improve the discriminative quality of the synthesized samples, we propose a classifier-driven guidance by injecting a classification consistency loss from a pre-trained model into the diffusion training process. Besides, considering the rich semantic complexity of remote sensing imagery, we further perform latent space clustering on training samples to select representative and diverse prototypes as visual style guidance, while using a visual language model to provide aggregated text descriptions. Experiments on three high-resolution remote sensing scene classification benchmarks show that the proposed method can distill realistic and diverse samples for downstream model training. Code and pre-trained models are available online (https://github.com/YonghaoXu/DPD).