Fast-weight Product Key Memory

๐Ÿ“… 2026-01-02
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the trade-off between memory capacity and computational efficiency in sequence modeling layers of language models by proposing a dynamic fast-weight Product Key Memory (PKM) mechanism. Building upon the fast weight paradigm, the method transforms the conventional static PKM into a module with dynamic contextual memory capabilities, enabling real-time parameter updates during both training and inference via localized block-wise gradient descent. This facilitates efficient writing and retrieval of new key-value pairs. By integrating sparse PKM architecture with dynamic parameter adaptation, the approach substantially reduces perplexity on long-context tasks. Notably, it achieves strong generalization on a 128K-token โ€œneedle-in-a-haystackโ€ retrieval task after training on only 4K tokens, thereby overcoming the limitations inherent in static memory modules.

Technology Category

Application Category

๐Ÿ“ Abstract
Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic,"fast-weight"episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
Problem

Research questions and friction points this paper is trying to address.

sequence modeling
storage capacity
computational efficiency
episodic memory
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fast-weight Memory
Product Key Memory
Episodic Memory
Long-context Modeling
Dynamic Parameter Update