Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitation of Native Sparse Attention (NSA) in capturing long-range dependencies for long-context modeling, this paper proposes a dynamic alternating local-global attention mechanism, coupled with an inter-layer switching architecture that synergistically integrates Multi-Head Latent Attention (MLA) and Group-Head Latent Attention (GLA). The method jointly leverages sliding-window attention, key-value (KV) compression, and selective attention to enhance cross-region information propagation while preserving sparsity. Experiments on models ranging from 340M to 1.3B parameters demonstrate that our approach matches or surpasses full attention baselines on commonsense reasoning and long-text understanding tasks. Moreover, it reduces KV cache memory consumption by up to 50%, significantly improving both efficiency and effectiveness in long-sequence modeling.

Technology Category

Application Category

📝 Abstract
In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression, selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA's branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50% versus NSA while improving the model's common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing sparse attention for long-context modeling
Alternating local and global attention across layers
Reducing KV-cache memory while improving reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Alternating local and global attention across layers
Enhanced sliding-window with Multi-head Latent Attention
Compression and selective branches use Group-head Latent Attention
Y
Yuxuan Hu
School of Information, Renmin University of China, Beijing, China
Jianchao Tan
Jianchao Tan
Meituan
LLMAutomated Machine LearningComputer GraphicsComputer Vision
J
Jiaqi Zhang
Meituan, Beijing, China
W
Wen Zan
Meituan, Beijing, China
P
Pingwei Sun
Meituan, Beijing, China
Y
Yifan Lu
Meituan, Beijing, China
Y
Yerui Sun
Meituan, Beijing, China
Y
Yuchen Xie
Meituan, Beijing, China
X
Xunliang Cai
Meituan, Beijing, China
J
Jing Zhang
School of Information, Renmin University of China, Beijing, China