Circuit Distillation

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional knowledge distillation focuses on behavioral imitation, treating the teacher model as a black box and transferring only output distributions. Method: This paper proposes “circuit distillation”—a mechanism-aware approach that transfers the teacher’s underlying computational architecture rather than superficial behavior. It aligns interpretable internal components (e.g., entity-tracking and theory-of-mind circuits) between Llama3-based teacher and student models via functional circuit correspondence matching and representation similarity loss. Contribution/Results: To our knowledge, this is the first work achieving mechanism-level distillation grounded in functional circuits. It enables transfer of complex algorithmic capabilities with minimal parameter tuning—only a small subset of student parameters is fine-tuned. The method significantly enhances model interpretability and controllability. Empirical evaluation on entity tracking and theory-of-mind tasks demonstrates superior performance over conventional distillation baselines, validating that mechanistic alignment—not just statistical mimicry—is essential for effective algorithmic capability transfer.

Technology Category

Application Category

📝 Abstract
Model distillation typically focuses on behavioral mimicry, where a student model is trained to replicate a teacher's output while treating its internal computations as a black box. In this work we propose an alternative approach: Distilling the underlying computational mechanisms implemented by a teacher model. Specifically, we propose circuit distillation, which introduces an objective to align internal representations between analogous circuit components in teacher and student models. We propose a method to match ``functionally correspondent'' circuit components and introduce a loss reflecting similarities between the representations that these induce. We evaluate circuit distillation on entity tracking and theory of mind (ToM) tasks using models from the Llama3 family. Our results demonstrate that circuit distillation outperforms standard distillation, successfully transferring algorithmic capabilities by adjusting only a small, targeted subset of student model parameters. This work establishes the feasibility of transferring mechanisms, which may in turn allow for efficient distillation of targeted teacher capabilities via interpretable and controllable internal student mechanisms.
Problem

Research questions and friction points this paper is trying to address.

Distilling computational mechanisms from teacher models
Aligning internal representations between corresponding circuit components
Transferring algorithmic capabilities via interpretable internal mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligning internal representations between teacher and student circuits
Matching functionally correspondent circuit components
Transferring algorithmic capabilities via targeted parameter adjustment