LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

The quadratic computational complexity of Transformers severely hinders long-context modeling and edge deployment. To address this, we propose LAWCAT—a framework that achieves efficient attention compression from quadratic to linear complexity via synergistic distillation of token-wise causal Conv1D and normalized gated linear attention. Our key contributions are: (i) enhancing causal dependency modeling through localized convolution; (ii) designing a generalizable linear attention architecture, trained on only 1K-length sequences yet scalable to 22K-context windows; and (iii) integrating knowledge distillation with low-resource training, reducing data requirements to <0.1% of the original. Evaluated on Mistral-7B, LAWCAT maintains >90% accuracy at 22K context length while outperforming FlashAttention-2 in prefill latency—significantly improving both long-range modeling efficiency and practical deployability.

Technology Category

Application Category

📝 Abstract

Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1&2&3 tasks (1K-8K context length) and BABILong benchmark (QA2&QA3, 0K-16K context length), requiring less than 0.1% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of transformers for long sequences

Enables efficient knowledge transfer from pre-trained transformers

Improves linear attention generalization across varying context lengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear attention with convolution across tokens

Distillation from quadratic to linear attention

Normalized gated linear attention for generalization

🔎 Similar Papers

Unifying Linear-Time Attention via Latent Probabilistic Modelling