Improving Sparse Memory Finetuning

πŸ“… 2026-04-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of catastrophic forgetting in large language models during continual learning, which arises from updates to shared parameters. The authors propose Sparse Memory Fine-tuning (SMF), a method that integrates an explicit sparse memory module into the Qwen-2.5-0.5B model to localize knowledge updates to a minimal subset of parameters. A novel slot selection mechanism based on KL divergence is introduced to prioritize memory allocation for tokens exhibiting high β€œinformation surprise,” thereby enhancing update efficiency. By combining parameter-efficient fine-tuning with an open-source model adaptation pipeline, SMF significantly mitigates forgetting on consumer-grade hardware, effectively demonstrating the feasibility and practicality of sparse updating paradigms in continual learning scenarios.
πŸ“ Abstract
Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally "surprising" tokens relative to a background distribution. Our experiments demonstrate that our retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting.
Problem

Research questions and friction points this paper is trying to address.

continual learning
catastrophic forgetting
sparse memory
large language models
knowledge updating
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Memory Finetuning
continual learning
KL divergence
catastrophic forgetting
parameter-efficient tuning
πŸ”Ž Similar Papers
No similar papers found.
Satyam Goyal
Satyam Goyal
University of Michigan, Ann Arbor
Generative AIArtificial IntelligenceDeep Learning
A
Anirudh Kanchi
University of Michigan, Ann Arbor
G
Garv Shah
University of Michigan, Ann Arbor
P
Prakhar Gupta
University of Michigan, Ann Arbor