๐ค AI Summary
This work addresses the trade-off between memory capacity and computational efficiency in sequence modeling layers of language models by proposing a dynamic fast-weight Product Key Memory (PKM) mechanism. Building upon the fast weight paradigm, the method transforms the conventional static PKM into a module with dynamic contextual memory capabilities, enabling real-time parameter updates during both training and inference via localized block-wise gradient descent. This facilitates efficient writing and retrieval of new key-value pairs. By integrating sparse PKM architecture with dynamic parameter adaptation, the approach substantially reduces perplexity on long-context tasks. Notably, it achieves strong generalization on a 128K-token โneedle-in-a-haystackโ retrieval task after training on only 4K tokens, thereby overcoming the limitations inherent in static memory modules.
๐ Abstract
Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic,"fast-weight"episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.