Knowledge distillation through geometry-aware representational alignment

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing feature-based distillation methods (e.g., MSE, CKA) fail to capture the geometric structure of teacher feature spaces, leading to structural mismatch even when loss vanishes. Method: We propose a geometry-aware distillation framework that theoretically exposes the fundamental limitations of conventional approaches in preserving structural invariance. Our method introduces the Procrustes distance—measuring optimal rigid alignment—and the Frobenius norm of Gram matrices—capturing second-order feature statistics—as complementary structural priors in the distillation loss. Contribution/Results: This enables joint modeling of rigid transformations and pairwise correlations in feature space. Evaluated on cross-architecture settings (e.g., BERT→OPT) across classification and instruction-following tasks, our approach achieves up to +2.0 percentage points improvement over state-of-the-art baselines including CKA. The work establishes a provably grounded, geometry-driven paradigm for structural alignment in knowledge distillation.

Technology Category

Application Category

📝 Abstract
Knowledge distillation is a common paradigm for transferring capabilities from larger models to smaller ones. While traditional distillation methods leverage a probabilistic divergence over the output of the teacher and student models, feature-based distillation methods often minimize variants of Euclidean norms between the hidden layer representations. The main goal is for the student to mimic the structure of the feature space of the teacher. In this work, we theoretically show that existing feature distillation methods, such as projection based mean squared loss or Centered Kernel Alignment (CKA), cannot capture the feature structure, even under zero loss. We then motivate the use of Procrustes distance and the Frobenius norm of Feature Gram Matrix, distances already common in the context of measuring representational alignment, as distillation losses. We show that feature distillation through our method showcases statistically significant improvement in distillation performance across language models families (BERT and OPT) in classification and instruction-following tasks by up to 2 percentage points, showcasing the potential of integrating feature geometry into existing distillation methods.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of existing feature distillation methods' geometry capture
Proposing geometry-aware losses using Procrustes distance and Gram Matrix
Improving distillation performance across BERT and OPT model families
Innovation

Methods, ideas, or system contributions that make the work stand out.

Procrustes distance for feature alignment
Frobenius norm of Feature Gram Matrix
Geometry-aware distillation improves language models
🔎 Similar Papers
No similar papers found.
P
Prajjwal Bhattarai
New York University Abu Dhabi
M
Mohammad Amjad
New York University Abu Dhabi
D
Dmytro Zhylko
New York University Abu Dhabi
Tuka Alhanai
Tuka Alhanai
New York University Abu Dhabi
machine learningcomputer sciencesignal processing