🤖 AI Summary
This work investigates token efficiency optimization in large language model (LLM) fine-tuning under fixed computational budgets, revealing that data composition—specifically the interplay between sample count and average sequence length—exerts a stronger influence on performance than total token count alone. To address this, we propose the first fine-tuning scaling law that explicitly models data composition, departing from conventional token-count-only assumptions. Empirical analysis on BRICC and MMLU subsets, combined with diverse subsampling strategies and standard scaling law fitting, robustly demonstrates the significant impact of data composition on token efficiency. Our findings yield quantifiable, interpretable theoretical principles and practical guidelines for resource-constrained LLM fine-tuning.
📝 Abstract
We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of examples and their average token length -- what we term emph{dataset volume} -- play a decisive role in model performance. Our formulation is tuned following established procedures. Experiments on the BRICC dataset cite{salavati2024reducing} and subsets of the MMLU dataset cite{hendrycks2021measuringmassivemultitasklanguage}, evaluated under multiple subsampling strategies, reveal that data composition significantly affects token efficiency. These results motivate refined scaling laws for practical LLM fine-tuning in resource-constrained settings.