Learning to (Learn at Test Time): RNNs with Expressive Hidden States

๐Ÿ“… 2024-07-05
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 74
โœจ Influential: 2
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional RNNs suffer from weak representational capacity of hidden states and difficulty in modeling long-range dependencies. Method: This paper proposes the Test-Time Training (TTT) layer paradigm, which treats RNN hidden states as trainable modulesโ€”e.g., linear models or two-layer MLPsโ€”and performs self-supervised updates per token during inference, enabling online adaptation of hidden states. Crucially, this retains linear time complexity while overcoming the fixed-parameter limitation of conventional RNNs. Contribution/Results: Experiments on models ranging from 125M to 1.3B parameters show that TTT-Linear and TTT-MLP achieve consistently decreasing perplexity on ultra-long contexts (>16K tokens), significantly outperforming Mamba and matching Transformer performance. This work is the first to empirically validate that linear-complexity architectures can simultaneously achieve strong expressivity and online learning capability through adaptive hidden-state design.

Technology Category

Application Category

๐Ÿ“ Abstract
Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.
Problem

Research questions and friction points this paper is trying to address.

Self-attention has quadratic complexity in long contexts
Existing RNNs lack expressive hidden states for long contexts
Proposes linear-complexity layers with trainable hidden states (TTT)
Innovation

Methods, ideas, or system contributions that make the work stand out.

RNN with expressive hidden states
Test-Time Training (TTT) layers
Linear complexity with self-supervised learning
๐Ÿ”Ž Similar Papers
No similar papers found.