TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The high computational cost and prohibitive trial-and-error overhead of training and evaluating large language models (LLMs) hinder rapid experimentation and analysis. Method: We propose a simplified language environment paradigm tailored for small models, constructing the Leaner family of lightweight datasets (71M tokens for pretraining; 7M for instruction tuning) via denoising, genre-aware vocabulary compression, and distribution-preserving techniques—ensuring low noise, compact vocabulary, and genre fidelity. We further design a training–evaluation co-design framework integrating LLM-assisted data distillation, vocabulary-minimized modeling, and multi-granularity instruction-following evaluation. Results: Tiny models trained on Leaner achieve comprehensive instruction-following improvements over baselines. Leaner-Pretrain enables resource-controllable linguistic modeling attribution analysis. Our methodology demonstrates efficient scalability to large-scale, complex scenarios while preserving efficacy.

Technology Category

Application Category

📝 Abstract
Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns (e.g., for books, conversation, code, etc.). Implementing this pipeline with large LMs, we have created a leaner suite of LM training and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct, Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testing instruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency. Tiny LMs trained on these datasets outperform those trained on original datasets in instruction-following across different language granularity levels. Moreover, the Leaner-Pretrain dataset's alignment with conventional large LM training sets enables resource-optimized analysis of how learning objectives, model architectures, and training techniques impact performance on language modeling and downstream tasks. Our code and datasets are available at https://github.com/EmpathYang/TinyHelen.git.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model Training
Cost Efficiency
Learning Strategies Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Dataset Creation
Enhanced Learning Efficiency
Reduced Model Complexity
🔎 Similar Papers
No similar papers found.