LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of LoRA fine-tuning, which often struggles to surpass dense training baselines due to sensitivity to initialization and poor learning rate transferability across ranks. The authors generalize the spectral steepest descent rule of the Muon optimizer to the low-rank manifold and introduce a hardware-friendly proxy optimization method for low-rank adaptation. By decoupling weight decay from factorized parameter updates, their approach eliminates the need for QR decomposition and storage of second-order moments. Crucially, the proposed method enables stable learning rate transfer across varying ranks, model widths, depths, and scaling factors. On the TinyShakespeare benchmark, it matches the optimal dense training performance at rank-2 and significantly outperforms the dense baseline at rank-32 under seed averaging.

📝 Abstract

Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

Problem

Research questions and friction points this paper is trying to address.

Low-Rank Adaptation

fine-tuning

optimization

learning rate transfer

rank sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA-Muon

spectral steepest descent

low-rank adaptation