🤖 AI Summary
Existing low-rank adapters suffer from limited expressivity due to rigid, fixed-rank constraints, hindering their effectiveness on complex tasks. To address this, we propose MoKA—a parameter-efficient fine-tuning method based on *mixture of Kronecker products*. MoKA employs a learnable gating mechanism to dynamically evaluate and combine Kronecker factors, enabling fine-grained, task-adaptive rank allocation. This design preserves extreme parameter efficiency—reducing trainable parameters by up to 27×—while substantially enhancing modeling capacity. Crucially, MoKA relies solely on standard matrix operations, ensuring native compatibility with GPU acceleration and low-bit quantization. Extensive experiments on quantized LLaMA2-7B and LLaMA3-8B models demonstrate that MoKA consistently outperforms state-of-the-art PEFT methods in both accuracy and efficiency, achieving new SOTA performance.
📝 Abstract
Parameter-efficient fine-tuning (PEFT) is essential for reducing the computational overhead of large language models (LLMs). Low-rank family adapters are commonly used to control the parameter size efficiently while maintaining the generative power of LLMs. However, their limited expressiveness due to the rank constraint often restricts their performance on complex tasks. We propose Mixture of Kronecker Adapters (MoKA), a new generation of Kronecker adapters that addresses this limitation by modeling weight updates as a mixture of Kronecker products. Our proposed adapter leverages a gating mechanism that measures the importance of each Kronecker factor, enabling more expressive adaptation. Moreover, MoKA enables a rank flexibility that provides a better trade-off between parameter efficiency and accuracy. To ensure hardware efficiency, we reformulate Kronecker computations using standard matrix operations, allowing seamless deployment on GPU-optimized hardware. We conduct extensive experiments on instruction-tuning and commonsense reasoning tasks using low-bit quantized versions of LLaMA2-7B and LLaMA3-8B models. MoKA not only outperforms PEFT baselines, but also reduces the number of trainable parameters up to 27x, achieving state-of-the-art trade-offs between performance and parameter efficiency.