Two heads are better than one: simulating large transformers with small ones

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The quadratic time and memory complexity $O(N^2)$ of Transformer self-attention impedes scalability for long sequences. Method: We propose a block-wise collaborative emulation framework that approximates a single long-context Transformer using $O(N/M)$ short-context Transformers operating in parallel, supported by computational modeling, block-aware architecture design, sequence partitioning, and cross-block information aggregation. Contribution/Results: We theoretically establish that, under average-case assumptions, our framework achieves optimal $O(N)$ complexity—surpassing the conventional $O((N/M)^2)$ lower bound for block-based approaches. We further identify, for the first time, the critical role of sliding-window attention and attention sink mechanisms in enabling efficient decomposition. Empirically, our method preserves functional equivalence while reducing memory and computation from $O(N^2)$ to $O(N)$, significantly improving hardware compatibility and inference efficiency across diverse long-sequence benchmarks.

Technology Category

Application Category

📝 Abstract

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences? In this paper, we show that transformers with long input sequences (large transformers) can be efficiently simulated by transformers that can only take short input sequences (small transformers). Specifically, we prove that any transformer with input length $N$ can be efficiently simulated by only $O((N/M)^2)$ transformers with input length $M ll N$, and that this cannot be improved in the worst case. However, we then prove that in various natural scenarios including average-case inputs, sliding window masking and attention sinks, the optimal number $O(N/M)$ of small transformers suffice.

Problem

Research questions and friction points this paper is trying to address.

Simulate large transformers using small ones efficiently

Reduce quadratic complexity of self-attention in transformers

Optimize transformer performance for long input sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulate large transformers with small ones

Efficiently handle long sequences via partitioning

Optimize using sliding window and attention sinks

🔎 Similar Papers

No similar papers found.

Authors to Follow