Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dense attention mechanisms struggle to scale to long sequences due to their quadratic complexity, while existing sparse approaches often rely on coarse-grained block partitioning that blurs semantic boundaries and discards critical information. This work proposes PHSA, a trainable sparse attention framework that explicitly leverages punctuation marks as anchors for semantic boundaries for the first time. PHSA introduces a punctuation-aware dual-branch aggregation mechanism to integrate global semantics with boundary-specific features and employs an extremely sparse adaptive training strategy, enabling high-fidelity long-context modeling with nearly zero additional computational overhead. Evaluated at a 32k input length and 97.3% sparsity, a 0.6B-parameter model using PHSA reduces information loss by 10.8% and outperforms both dense attention and state-of-the-art sparse methods, including InfLLM v2, across general and long-context benchmarks.

Technology Category

Application Category

📝 Abstract
Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention has received increasing attention as a scalable alternative. However, existing sparse attention methods rely on coarse-grained semantic representations during block selection, which blur intra-block semantic boundaries and lead to the loss of critical information. To address this issue, we propose \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}, a natively trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors. Specifically, (1) we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead; (2) we introduce an extreme-sparsity-adaptive training and inference strategy that stabilizes model behavior under very low token activation ratios; Extensive experiments on general benchmarks and long-context evaluations demonstrate that PHSA consistently outperforms dense attention and state-of-the-art sparse attention baselines, including InfLLM v2. Specifically, for the 0.6B-parameter model with 32k-token input sequences, PHSA can reduce the information loss by 10.8\% at a sparsity ratio of 97.3\%.
Problem

Research questions and friction points this paper is trying to address.

sparse attention
semantic boundaries
long-context modeling
punctuation-aware
information loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse attention
punctuation-aware
long-context modeling
trainable attention
hybrid attention
🔎 Similar Papers
No similar papers found.
J
Junxiang Qiu
University of Science and Technology of China
Shuo Wang
Shuo Wang
University of Science and Technology of China
Computer VisionMultimedia
Z
Zhengsu Chen
Huawei Inc.
H
Hengheng Zhang
Huawei Inc.
Jinda Lu
Jinda Lu
University of Science and Technology of China
C
Changcheng Li
University of Science and Technology of China
Q
Qi Tian
Huawei Inc.