🤖 AI Summary
This work addresses the limitation of conventional SwiGLU in Mixture-of-Experts (MoE) models, which employs a fixed gating sharpness and thus struggles to dynamically balance smoothness and selectivity in expert selection according to routing confidence. The authors propose κ-SwiGLU, a novel gating mechanism that models the sharpness coefficient of the SiLU gate as a learnable function of router logits, enabling each expert’s gating unit to adaptively interpolate between smooth and sharp gating based on token-level routing confidence. This approach introduces, for the first time, a confidence-aware gating sharpness mechanism that enables dynamic, adaptive expert activation in MoE architectures. Evaluated on the FineWeb-Edu dataset, κ-SwiGLU consistently improves CORE performance across MoE models with 8 to 28 layers, with negligible parameter overhead and minimal computational cost.
📝 Abstract
SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU ($κ$-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, $κ$-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate $κ$-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, $κ$-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.