Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Dense attention mechanisms struggle to scale to long sequences due to their quadratic complexity, while existing sparse approaches often rely on coarse-grained block partitioning that blurs semantic boundaries and discards critical information. This work proposes PHSA, a trainable sparse attention framework that explicitly leverages punctuation marks as anchors for semantic boundaries for the first time. PHSA introduces a punctuation-aware dual-branch aggregation mechanism to integrate global semantics with boundary-specific features and employs an extremely sparse adaptive training strategy, enabling high-fidelity long-context modeling with nearly zero additional computational overhead. Evaluated at a 32k input length and 97.3% sparsity, a 0.6B-parameter model using PHSA reduces information loss by 10.8% and outperforms both dense attention and state-of-the-art sparse methods, including InfLLM v2, across general and long-context benchmarks.

Technology Category

Application Category

📝 Abstract

Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention has received increasing attention as a scalable alternative. However, existing sparse attention methods rely on coarse-grained semantic representations during block selection, which blur intra-block semantic boundaries and lead to the loss of critical information. To address this issue, we propose \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}, a natively trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors. Specifically, (1) we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead; (2) we introduce an extreme-sparsity-adaptive training and inference strategy that stabilizes model behavior under very low token activation ratios; Extensive experiments on general benchmarks and long-context evaluations demonstrate that PHSA consistently outperforms dense attention and state-of-the-art sparse attention baselines, including InfLLM v2. Specifically, for the 0.6B-parameter model with 32k-token input sequences, PHSA can reduce the information loss by 10.8\% at a sparsity ratio of 97.3\%.

Problem

Research questions and friction points this paper is trying to address.

sparse attention

semantic boundaries

long-context modeling

punctuation-aware

information loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse attention

punctuation-aware

long-context modeling

trainable attention

hybrid attention

🔎 Similar Papers

No similar papers found.

Authors to Follow