Improving Sparse Memory Finetuning

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the challenge of catastrophic forgetting in large language models during continual learning, which arises from updates to shared parameters. The authors propose Sparse Memory Fine-tuning (SMF), a method that integrates an explicit sparse memory module into the Qwen-2.5-0.5B model to localize knowledge updates to a minimal subset of parameters. A novel slot selection mechanism based on KL divergence is introduced to prioritize memory allocation for tokens exhibiting high “information surprise,” thereby enhancing update efficiency. By combining parameter-efficient fine-tuning with an open-source model adaptation pipeline, SMF significantly mitigates forgetting on consumer-grade hardware, effectively demonstrating the feasibility and practicality of sparse updating paradigms in continual learning scenarios.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally "surprising" tokens relative to a background distribution. Our experiments demonstrate that our retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting.

Problem

Research questions and friction points this paper is trying to address.

continual learning

catastrophic forgetting

sparse memory

large language models

knowledge updating

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Memory Finetuning

continual learning

KL divergence