Training Dynamics of a 1.7B LLaMa Model: A Data-Efficient Approach

📅 2024-12-17

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the low training efficiency and high data requirements of large language models (LLMs) in resource-constrained settings, this work proposes a data-efficient pretraining paradigm that fully trains a 1.7B-parameter LLaMA variant—including both pretraining and instruction fine-tuning—using only 20 billion high-quality tokens. Methodologically, it integrates architectural lightweighting of LLaMA, progressive learning rate scheduling, optimizer state warm restart, mixed-precision training, and instruction-supervised fine-tuning (SFT) to enhance generalization. Through systematic tracking of loss dynamics and downstream performance transitions, we quantitatively establish, for the first time, the correlation between training dynamics and the emergence of context-aware capabilities. We publicly release all training logs, checkpoints, and reproducibility scripts. Results demonstrate competitive performance against larger models across multiple benchmarks, validating that high-quality data combined with refined training strategies significantly reduces token consumption while preserving throughput efficiency and training stability.

Technology Category

Application Category

📝 Abstract

Pretraining large language models is a complex endeavor influenced by multiple factors, including model architecture, data quality, training continuity, and hardware constraints. In this paper, we share insights gained from the experience of training DMaS-LLaMa-Lite, a fully open source, 1.7-billion-parameter, LLaMa-based model, on approximately 20 billion tokens of carefully curated data. We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output. Beyond pretraining, we extend our analysis to include a post-training phase focused on instruction tuning, where the model was refined to produce more contextually appropriate, user-aligned responses. We highlight practical considerations such as the importance of restoring optimizer states when resuming from checkpoints, and the impact of hardware changes on training stability and throughput. While qualitative evaluation provides an intuitive understanding of model improvements, our analysis extends to various performance benchmarks, demonstrating how high-quality data and thoughtful scaling enable competitive results with significantly fewer training tokens. By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies. The training script is available on Github at https://github.com/McGill-DMaS/DMaS-LLaMa-Lite-Training-Code. The model checkpoints are available on Huggingface at https://huggingface.co/collections/McGill-DMaS/dmas-llama-lite-6761d97ba903f82341954ceb.

Problem

Research questions and friction points this paper is trying to address.

Optimizing training dynamics for a 1.7B LLaMa model efficiently

Exploring data quality impact on model performance with limited tokens

Improving instruction tuning for contextually appropriate model responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 1.7B LLaMa model with 20B tokens

Focuses on data-efficient pretraining and tuning

Emphasizes checkpoint recovery and hardware impact

🔎 Similar Papers

No similar papers found.