🤖 AI Summary
In supervised fine-tuning (SFT), large language models (LLMs) often overemphasize lengthy chain-of-thought (CoT) reasoning, thereby weakening modeling of critical answer tokens and degrading answer accuracy. To address this, we propose SFTKey—a two-stage token-level fine-tuning method: Stage 1 ensures CoT format compliance via standard SFT; Stage 2 applies a weighted loss mask exclusively to answer tokens, explicitly decoupling and strengthening their optimization objective. This is the first work to introduce CoT-aware, token-level weighting and a phased answer-focusing mechanism in SFT. Experiments across multiple benchmarks (e.g., GSM8K, MATH, SVAMP) and model families (e.g., LLaMA-3, Qwen, Phi-3) demonstrate an average accuracy improvement of over 5%, with substantial gains in final answer correctness—while fully preserving CoT formatting fidelity and generation capability.
📝 Abstract
With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.