🤖 AI Summary
This work addresses the inefficiency in zeroth-order (ZO) fine-tuning of large language models caused by unclear layer-wise contributions, which leads to unnecessary computational overhead. The study identifies a dominant decoder layer whose exclusive fine-tuning achieves performance on par with or even surpassing full-model ZO tuning. This dominant layer is determined by model architecture rather than task specifics and can be reliably identified in advance through anomalous activation patterns during forward inference. Its high sensitivity, combined with its early position in the residual stream, generates a strong optimization signal. Leveraging this insight, the authors propose an efficient single-layer ZO fine-tuning strategy that outperforms both full-model MeZO and LoRA-based ZO methods across nine benchmarks on LLaMA2-7B and Qwen3-8B, achieving up to a 4.52× speedup in training.
📝 Abstract
Zeroth-order (ZO) optimization enables memory-efficient fine-tuning of large language models (LLMs) using only forward passes, but it remains unclear how useful adaptation is distributed across layers. In this work, we reveal a surprising phenomenon: ZO fine-tuning is sharply dominated by a single decoding layer. Across multiple LLM families and downstream tasks, fine-tuning this dominant layer alone consistently matches or even exceeds full-model ZO fine-tuning. We further show that the dominant layer is task-agnostic but model-specific, and can be identified before training through a simple inference-only analysis of activation outliers. Specifically, the dominant layer consistently aligns with the first activation-outlier layer in the pre-trained model. To explain this phenomenon, we analyze how perturbation effects propagate under ZO optimization. We find that the dominant layer combines two key properties: high perturbation sensitivity and early placement in the residual stream, allowing perturbation-induced effects to propagate and accumulate through remaining subsequent decoding layers. As a result, this layer produces disproportionately strong and stable optimization signals under forward-only updates. Extensive experiments on LLaMA2-7B and Qwen3-8B across nine benchmarks show that dominant-layer ZO fine-tuning improves average performance over full-model MeZO and LoRA-based ZO fine-tuning while achieving up to 4.52$\times$ training speedup.