🤖 AI Summary
To address the inflexibility and inefficiency of inner-loop optimization in dataset distillation—where conventional methods rely on fixed random truncation—this paper proposes Adaptive Truncation Backpropagation Through Time (AT-BPTT). Its core contributions are threefold: (1) the first stage-aware probabilistic timestep selection mechanism, enabling dynamic prioritization of informative timesteps; (2) an adaptive truncation window adjustment strategy guided by gradient variance detection, improving convergence stability and speed; and (3) a low-rank Hessian approximation to stabilize gradient estimation under truncation. Evaluated across multiple image benchmarks, AT-BPTT achieves an average 6.16% improvement in distilled model accuracy, accelerates inner-loop optimization by 3.9×, and reduces memory overhead by 63%, establishing new state-of-the-art performance.
📝 Abstract
The growing demand for efficient deep learning has positioned dataset distillation as a pivotal technique for compressing training dataset while preserving model performance. However, existing inner-loop optimization methods for dataset distillation typically rely on random truncation strategies, which lack flexibility and often yield suboptimal results. In this work, we observe that neural networks exhibit distinct learning dynamics across different training stages-early, middle, and late-making random truncation ineffective. To address this limitation, we propose Automatic Truncated Backpropagation Through Time (AT-BPTT), a novel framework that dynamically adapts both truncation positions and window sizes according to intrinsic gradient behavior. AT-BPTT introduces three key components: (1) a probabilistic mechanism for stage-aware timestep selection, (2) an adaptive window sizing strategy based on gradient variation, and (3) a low-rank Hessian approximation to reduce computational overhead. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods. Moreover, our approach accelerates inner-loop optimization by 3.9x while saving 63% memory cost.