🤖 AI Summary
Existing dataset condensation methods struggle to simultaneously preserve the geometric structure and fidelity of data distributions required for effective diffusion model training. This work proposes a geometry-aware subset selection approach that formulates real subset selection as a distribution alignment problem. It introduces, for the first time, one-sided partial optimal transport and integrates feature statistics, semantic consistency regularization, and a two-stage discrete optimization strategy to efficiently retain critical geometric and semantic information while significantly reducing dataset size. Extensive experiments demonstrate that the proposed method consistently achieves superior generation fidelity and distribution coverage across diverse diffusion models, subset ratios, image resolutions, and training configurations.
📝 Abstract
Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.