🤖 AI Summary
To address performance degradation in few-shot adaptation of large language models due to data scarcity, this paper establishes, for the first time, a formal theoretical proof that the original pretraining dataset of the backbone model can be safely and effectively reused for downstream adaptation. Leveraging this insight, we propose ALBAT—an Adaptive Backbone data selection framework—integrating mathematical modeling, backbone data distillation, dynamic weighted sampling, and lightweight fine-tuning. Evaluated on personalized image generation and low-resource language generation, ALBAT achieves full fine-tuning performance using only 10% of the adaptation data, significantly improving few-shot adaptation efficiency and generalization. Our core contributions are: (i) a rigorous theoretical foundation for safe pretraining data reuse, and (ii) a practical, empirically verifiable data repurposing paradigm that bridges pretraining and adaptation without compromising robustness or fidelity.
📝 Abstract
Adaptations facilitate efficient training of large backbone models, including diffusion models for image generation and transformer-based language models. While various adaptation techniques enhance performance with minimal computational resources, limited adaptation data often leads to challenges in training. To address this, we focus on the enormous amount of backbone data used to pre-train the backbone models. We propose Backbone Augmented Training (BAT), a method that leverages backbone data to augment the adaptation dataset. First, we formulate and prove two mathematical key propositions: one establishes the validity of BAT, while the other identifies a condition under which BAT benefits adaptation. Furthermore, we introduce an advanced data selection scheme that satisfies these propositions and present ALBAT algorithm to implement this approach. ALBAT efficiently enhances adaptation training in both personalization and language generation tasks with scarce data.