MAR: Efficient Large Language Models via Module-aware Architecture Refinement

๐Ÿ“… 2026-01-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the high computational cost and energy consumption of large language models, primarily caused by the quadratic complexity of attention mechanisms and dense feedforward networks. To mitigate these issues, the authors propose the Module-Aware Refinement (MAR) framework, which first replaces attention with state space models (SSMs) to achieve linear-complexity sequence modeling and then reduces feedforward computation via activation sparsification. Furthermore, they introduce Adaptive Ternary Multi-step Neurons (ATMN) and a Spiking-aware Bidirectional Distillation Strategy (SBDS) to effectively alleviate low information density and temporal misalignment arising from integrating SSMs with spiking neural networks. Experiments demonstrate that MAR significantly reduces inference energy consumption while recovering the performance of dense models, outperforming both comparable and larger-scale efficient architectures.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we propose Module-aware Architecture Refinement (MAR), a two-stage framework that integrates State Space Models (SSMs) for linear-time sequence modeling and applies activation sparsification to reduce FFN costs. In addition, to mitigate low information density and temporal mismatch in integrating Spiking Neural Networks (SNNs) with SSMs, we design the Adaptive Ternary Multi-step Neuron (ATMN) and the Spike-aware Bidirectional Distillation Strategy (SBDS). Extensive experiments demonstrate that MAR effectively restores the performance of its dense counterpart under constrained resources while substantially reducing inference energy consumption. Furthermore, it outperforms efficient models of comparable or even larger scale, underscoring its potential for building efficient and practical LLMs.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
energy efficiency
quadratic attention
dense Feed-Forward Network
inference cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Module-aware Architecture Refinement
State Space Models
Spiking Neural Networks
Activation Sparsification
Energy-efficient LLMs
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Junhong Cai
Department of CSE, Southern University of Science and Technology, Shenzhen, China
G
Guiqin Wang
School of Computer Science and Technology, Xiโ€™an Jiaotong University, Xiโ€™an, China; National Engineering Laboratory for Big Data Analytics, Xiโ€™an Jiaotong University, Xiโ€™an, China
K
Kejie Zhao
Department of CSE, Southern University of Science and Technology, Shenzhen, China
J
Jianxiong Tang
Department of Computer Science, City University of Hong Kong, Hong Kong, China
X
Xiang Wang
ACS Laboratory, Huawei Technologies Co., Ltd., Shenzhen, China
L
Luziwei Leng
ACS Laboratory, Huawei Technologies Co., Ltd., Shenzhen, China
R
Ran Cheng
Department of Data Science and Artificial Intelligence, Department of Computing, Hong Kong Polytechnic University, Hong Kong, China
Yuxin Ma
Yuxin Ma
Southern University of Science and Technology
Information VisualizationVisual AnalyticsHuman-Computer Interaction
Q
Qinghai Guo
ACS Laboratory, Huawei Technologies Co., Ltd., Shenzhen, China