🤖 AI Summary
Supervised fine-tuning (SFT) and knowledge distillation for large language models (LLMs) typically demand substantial labeled data and high computational resources. Method: This paper proposes an entropy-driven hierarchical fine-tuning paradigm. Its core innovation is the first use of per-token answer entropy as a lightweight, dynamic criterion to identify complex samples and adaptively trigger chain-of-thought reasoning and knowledge distillation. A complexity-aware data partitioning mechanism is further introduced to enable on-demand reasoning and efficient distillation. Results: Evaluated on a 3B-parameter model, the method achieves full-data distillation accuracy using only 38% of the training data, attaining an average accuracy of 0.55—significantly surpassing standard SFT (0.43)—and an ROC AUC of 0.73. The approach thus balances performance, training efficiency, and scalability.
📝 Abstract
General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ($approx 3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.55$ vs $0.43$ average accuracy) and provides comparable with distillation performance while using $62%$ less data ($0.55$ average accuracy for both). We publish our code and data to facilitate further research in this direction.