Comparing Knowledge Injection Methods for LLMs in a Low-Resource Regime

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of efficiently injecting limited unstructured knowledge into large language models (LLMs) under low-resource settings while mitigating catastrophic forgetting, this paper proposes a synthetic data augmentation method based on diverse prompting. It controllably generates high-variability, fact-aligned textual variants to replace human-annotated data, enabling joint optimization of knowledge injection and retention. The approach integrates a continual pretraining framework, a retrieval-augmented generation (RAG) baseline, and a self-generated data fine-tuning strategy. A fine-grained question-answering probing mechanism is designed to quantitatively assess factual learning and forgetting. Experiments demonstrate that parametric knowledge injection substantially outperforms RAG; increasing synthetic data diversity improves factual memory by up to 23.6%; and the model exhibits strong capability in autonomously generating high-quality training samples. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) often require vast amounts of text to effectively acquire new knowledge. While continuing pre-training on large corpora or employing retrieval-augmented generation (RAG) has proven successful, updating an LLM with only a few thousand or million tokens remains challenging. In this work, we investigate the task of injecting small, unstructured information into LLMs and its relation to the catastrophic forgetting phenomenon. We use a dataset of recent news -- ensuring no overlap with the model's pre-training data -- to evaluate the knowledge acquisition by probing the model with question-answer pairs related the learned information. Starting from a continued pre-training baseline, we explored different augmentation algorithms to generate synthetic data to improve the knowledge acquisition capabilities. Our experiments show that simply continuing pre-training on limited data yields modest improvements, whereas exposing the model to diverse textual variations significantly improves the learning of new facts -- particularly with methods that induce greater variability through diverse prompting. Furthermore, we shed light on the forgetting phenomenon in small-data regimes, illustrating the delicate balance between learning new content and retaining existing capabilities. We also confirm the sensitivity of RAG-based approaches for knowledge injection, which often lead to greater degradation on control datasets compared to parametric methods. Finally, we demonstrate that models can generate effective synthetic training data themselves, suggesting a pathway toward self-improving model updates. All code and generated data used in our experiments are publicly available, providing a resource for studying efficient knowledge injection in LLMs with limited data at https://github.com/hugoabonizio/knowledge-injection-methods.
Problem

Research questions and friction points this paper is trying to address.

Injecting small unstructured information into LLMs effectively
Balancing new knowledge acquisition with catastrophic forgetting prevention
Evaluating synthetic data generation for low-resource knowledge injection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmentation algorithms generate synthetic data
Diverse prompting improves learning new facts
Models self-generate effective synthetic training data
🔎 Similar Papers
No similar papers found.
H
Hugo Abonizio
Faculdade de Engenharia Elétrica e de Computação (FEEC), University of Campinas (Unicamp)
T
Thales Almeida
Instituto de Computação (IC), University of Campinas (Unicamp)
R
Roberto Lotufo
Faculdade de Engenharia Elétrica e de Computação (FEEC), University of Campinas (Unicamp), NeuralMind
Rodrigo Nogueira
Rodrigo Nogueira
Founder and CEO of Maritaca AI
Deep LearningNatural Language ProcessingInformation Retrieval