VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the memory bottleneck imposed by parameter explosion in large language models (LLMs), this paper proposes an adaptive wide-deep reuse feed-forward network (FFN) architecture. The method dynamically allocates token-level computational resources under a fixed parameter budget, leveraging a novel dual-path parameter reuse mechanism—simultaneously across width and depth—augmented by cognition-inspired difficulty-aware gating. Crucially, all additional capacity incurs only computational overhead, with zero incremental memory cost. Technically, it integrates sparse expert routing simulation, recursive FFN design, and joint parameter sharing. Extensive experiments across multi-scale models and diverse benchmarks demonstrate consistent superiority over state-of-the-art parameter-efficient methods; it improves inference quality at equivalent parameter counts and provides the first systematic empirical validation of the “computation-for-memory” paradigm.

Technology Category

Application Category

📝 Abstract
The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering "easy" tokens through the efficient width-wise route and allocating deeper iterative refinement to "hard" tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/VersatileFFN.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM architectural capacity within fixed parameter budget
Enable flexible parameter reuse in width and depth dimensions
Balance computational efficiency and processing depth for different token complexities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive wide-and-deep parameter reuse within fixed budget
Width-versatile path creates sub-experts from shared FFN
Depth-versatile path recursively processes complex tokens
🔎 Similar Papers
No similar papers found.
Y
Ying Nie
Huawei Noah’s Ark Lab
K
Kai Han
Huawei Noah’s Ark Lab
Hongguang Li
Hongguang Li
Shanghai Jiao Tong University
Nonlinear VibrationsNonlinear DynamicsSignal Processing
H
Hang Zhou
Huawei Noah’s Ark Lab
T
Tianyu Guo
Huawei Noah’s Ark Lab
E
Enhua Wu
ISCAS, University of Macau
X
Xinghao Chen
Huawei Noah’s Ark Lab
Yunhe Wang
Yunhe Wang
Noah's Ark Lab, Huawei Technologies
Deep LearningLanguage ModelMachine LearningComputer Vision