π€ AI Summary
Conventional supervised fine-tuning (SFT) of large language models (LLMs) employs sample-level data cleaning, overlooking intra-sample heterogeneity in token-level qualityβhigh-quality samples still contain redundant or uninformative tokens, whose persistent fitting may degrade downstream performance. Method: We propose the first fine-grained token-level cleaning framework, modeling token quality from a noise-label perspective. We design a gradient-based influence analysis for token quality assessment, introducing two paradigms: a fixed reference model and an iteratively self-evolving reference model, with theoretical upper bounds on estimation error. A threshold-driven dynamic filtering mechanism removes low-informativeness tokens. Results: Experiments demonstrate consistent performance gains across multiple downstream tasks, significantly outperforming sample-level cleaning while maintaining tractable computational overhead.
π Abstract
Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant or uninformative. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves performance across multiple downstream tasks.