LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The quadratic computational complexity of Transformers severely hinders long-context modeling and edge deployment. To address this, we propose LAWCAT—a framework that achieves efficient attention compression from quadratic to linear complexity via synergistic distillation of token-wise causal Conv1D and normalized gated linear attention. Our key contributions are: (i) enhancing causal dependency modeling through localized convolution; (ii) designing a generalizable linear attention architecture, trained on only 1K-length sequences yet scalable to 22K-context windows; and (iii) integrating knowledge distillation with low-resource training, reducing data requirements to <0.1% of the original. Evaluated on Mistral-7B, LAWCAT maintains >90% accuracy at 22K context length while outperforming FlashAttention-2 in prefill latency—significantly improving both long-range modeling efficiency and practical deployability.

Technology Category

Application Category

📝 Abstract
Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1&2&3 tasks (1K-8K context length) and BABILong benchmark (QA2&QA3, 0K-16K context length), requiring less than 0.1% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of transformers for long sequences
Enables efficient knowledge transfer from pre-trained transformers
Improves linear attention generalization across varying context lengths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear attention with convolution across tokens
Distillation from quadratic to linear attention
Normalized gated linear attention for generalization
🔎 Similar Papers
No similar papers found.
Z
Zeyu Liu
University of Southern California, USA
Souvik Kundu
Souvik Kundu
Sr. Staff Research Scientist, Intel AI Group; Ph.D - USC; IEEE/ACM DAC under-40 Innovator
Efficient AIEnergy Efficient ComputingLLMMultimodal Foundation Models
L
Lianghao Jiang
University of Southern California, USA
A
Anni Li
University of Southern California, USA
Srikanth Ronanki
Srikanth Ronanki
Amazon
Speech RecognitionNatural language processingArtificial intelligence
S
Sravan Bodapati
Amazon AGI, USA
Gourav Datta
Gourav Datta
Assistant Professor, Case Western Reserve University
P
Peter A. Beerel
University of Southern California, USA