Adaptive Temperature Based on Logits Correlation in Knowledge Distillation

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In knowledge distillation, the temperature parameter is typically set manually, leading to inefficient logits information transfer and lacking theoretical grounding. Method: This paper proposes a dynamic, self-adaptive temperature computation method based on the teacher model’s maximum logit value—establishing, for the first time, an explicit theoretical link between temperature design and logit correlation. The method is lightweight and interpretable, requiring only a single scalar (the maximum logit) to determine temperature, without introducing additional parameters or computational overhead. Contribution/Results: Convergence analysis reveals that knowledge distillation fundamentally entails transferring correlation structures among logits. Extensive experiments on standard benchmarks demonstrate that our method consistently outperforms static temperature strategies across diverse teacher-student architectures. Empirically, the resulting temperatures exhibit greater robustness and lower computational cost.

Technology Category

Application Category

📝 Abstract

Knowledge distillation is a technique to imitate a performance that a deep learning model has, but reduce the size on another model. It applies the outputs of a model to train another model having comparable accuracy. These two distinct models are similar to the way information is delivered in human society, with one acting as the"teacher"and the other as the"student". Softmax plays a role in comparing logits generated by models with each other by converting probability distributions. It delivers the logits of a teacher to a student with compression through a parameter named temperature. Tuning this variable reinforces the distillation performance. Although only this parameter helps with the interaction of logits, it is not clear how temperatures promote information transfer. In this paper, we propose a novel approach to calculate the temperature. Our method only refers to the maximum logit generated by a teacher model, which reduces computational time against state-of-the-art methods. Our method shows a promising result in different student and teacher models on a standard benchmark dataset. Algorithms using temperature can obtain the improvement by plugging in this dynamic approach. Furthermore, the approximation of the distillation process converges to a correlation of logits by both models. This reinforces the previous argument that the distillation conveys the relevance of logits. We report that this approximating algorithm yields a higher temperature compared to the commonly used static values in testing.

Problem

Research questions and friction points this paper is trying to address.

Optimizes temperature parameter in knowledge distillation.

Reduces computational time using maximum logit from teacher model.

Improves distillation performance across various model architectures.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic temperature calculation using maximum logit

Reduced computational time in distillation process

Improved logit correlation between teacher and student models

🔎 Similar Papers

No similar papers found.

Authors to Follow