DNAD: Differentiable Neural Architecture Distillation

📅 2025-04-25
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of balancing accuracy and computational cost in efficient neural network design, this paper proposes Differentiable Neural Architecture Distillation (DNAD), which jointly optimizes accuracy, parameter count, and FLOPs within a non-shared topology cell space. Methodologically, DNAD introduces a novel progressive shrinking mechanism for supernetworks to enable controllable compression of the architecture search space and, for the first time, integrates knowledge distillation into the differentiable search process to mitigate overfitting caused by single-level optimization in DARTS. Pareto-front optimization is employed to automatically generate a set of high-performance architectures achieving balanced multi-objective trade-offs. On ImageNet, the best-performing model achieves a 23.7% top-1 error rate with only 6.0M parameters and 598M FLOPs—substantially outperforming state-of-the-art DARTS variants. Moreover, DNAD yields diverse architectures with consistently lower error rates, fewer parameters, and reduced computational cost on both CIFAR-10 and ImageNet.

Technology Category

Application Category

📝 Abstract
To meet the demand for designing efficient neural networks with appropriate trade-offs between model performance (e.g., classification accuracy) and computational complexity, the differentiable neural architecture distillation (DNAD) algorithm is developed based on two cores, namely search by deleting and search by imitating. Primarily, to derive neural architectures in a space where cells of the same type no longer share the same topology, the super-network progressive shrinking (SNPS) algorithm is developed based on the framework of differentiable architecture search (DARTS), i.e., search by deleting. Unlike conventional DARTS-based approaches which yield neural architectures with simple structures and derive only one architecture during the search procedure, SNPS is able to derive a Pareto-optimal set of architectures with flexible structures by forcing the dynamic super-network shrink from a dense structure to a sparse one progressively. Furthermore, since knowledge distillation (KD) has shown great effectiveness to train a compact network with the assistance of an over-parameterized model, we integrate SNPS with KD to formulate the DNAD algorithm, i.e., search by imitating. By minimizing behavioral differences between the super-network and teacher network, the over-fitting of one-level DARTS is avoided and well-performed neural architectures are derived. Experiments on CIFAR-10 and ImageNet classification tasks demonstrate that both SNPS and DNAD are able to derive a set of architectures which achieve similar or lower error rates with fewer parameters and FLOPs. Particularly, DNAD achieves the top-1 error rate of 23.7% on ImageNet classification with a model of 6.0M parameters and 598M FLOPs, which outperforms most DARTS-based methods.
Problem

Research questions and friction points this paper is trying to address.

Design efficient neural networks balancing performance and complexity
Derive diverse Pareto-optimal architectures via progressive super-network shrinking
Enhance architecture search by integrating knowledge distillation to avoid over-fitting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Super-network progressive shrinking for flexible architectures
Integrates knowledge distillation with architecture search
Minimizes behavioral differences to avoid over-fitting
🔎 Similar Papers
No similar papers found.
X
Xuan Rao
School of Systems Science, Beijing Normal University, Beijing 100875, China
B
Bo Zhao
School of Systems Science, Beijing Normal University, Beijing 100875, China
Derong Liu
Derong Liu
Publishing from ND GMR Stevens UIC CASIA USTB GDUT SUSTech
Nonlinear dynamical systemsadaptive dynamic programmingintelligent controlrecurrent neural networks