PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

195K/year
🤖 AI Summary
This work addresses the instability of weight matrix condition numbers during large language model pretraining, which often leads to optimization difficulties and training oscillations. The authors propose a losslessly fusible polynomial preconditioning layer that stabilizes the condition number throughout training by reshaping the singular value spectrum of weight matrices via low-degree polynomials. This approach establishes, for the first time, a theoretical link between the geometric convergence of gradient descent in deep linear networks and the uniform boundedness of singular value spectra. Evaluated on Llama-1B pretraining with both AdamW and Muon optimizers, the method consistently outperforms standard Transformers, demonstrating robust effectiveness and generalization across optimizers while introducing no additional inference overhead.
📝 Abstract
We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control principle by proving that uniformly bounding each layer's singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at https://github.com/Empath-aln/PC-layer.
Problem

Research questions and friction points this paper is trying to address.

preconditioning
singular values
LLM pre-training
weight conditioning
training stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

polynomial preconditioning
singular value spectrum
weight conditioning
LLM pre-training
geometric convergence