🤖 AI Summary
This study investigates the interplay between model scale and domain specialization—exemplified by law—under computational constraints in continual pretraining. Method: We conduct controlled continual pretraining across a multi-scale model family (1.5B–14B parameters), comparing general-domain versus legal-domain data, incorporating domain-specific data filtering and construction, and evaluating performance on legal exam benchmarks. Contribution/Results: We uncover, for the first time, a nonlinear scaling relationship: domain specialization yields increasingly greater computational efficiency gains with larger models—relative improvement rises from +8% at 1.5B to +27% at 14B—deviating significantly from standard universal scaling laws. This reveals a strong coupling between model scale and domain adaptation in continual pretraining, challenging prevailing scaling paradigms. We propose a novel “scale-adaptation” framework tailored to professional domains, offering both theoretical foundations and practical guidance for efficient domain-specific large language model training.
📝 Abstract
Scaling laws for language models so far focused on finding the compute-optimal model size and token count for training from scratch. However, achieving this optimal balance requires significant compute resources due to the extensive data demands when training models from randomly-initialized weights. Continual pre-training offers a cost-effective alternative, leveraging the compute investment from pre-trained models to incorporate new knowledge without requiring extensive new data. Recent findings suggest that data quality influences constants in scaling laws, thereby altering the optimal parameter-token allocation ratio. Building on this insight, we investigate the interplay between domain specialization and model size during continual pre-training under compute-constrained scenarios. Our goal is to identify a compute-efficient training regime for this scenario and, potentially, detect patterns in this interplay that can be generalized across different model sizes and domains. To compare general and specialized training, we filtered a web-based dataset to extract legal domain data. We pre-trained models with 1.5B, 3B, 7B and 14B parameters on both the unfiltered and filtered datasets, then evaluated their performance on legal exams. Results show that as model size increases, the compute-effectiveness gap between specialized and general models widens.