DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the challenges of full-parameter fine-tuning of large language models in privacy-sensitive, resource-constrained decentralized settings, where non-IID data induces client drift and unstable convergence. To overcome these issues, the authors propose DECA, a novel framework that partitions model parameters into mutually exclusive blocks and employs sequential block-wise Adam optimization. DECA innovatively integrates first- and second-order momentum estimates at the block level and enhances training stability by leveraging local gradient statistics and consensus discrepancy signals. Experimental results demonstrate that DECA achieves rapid convergence, superior downstream performance, and high resource efficiency while substantially reducing communication and computational overhead, effectively balancing full-parameter adaptability with robustness in decentralized training.

📝 Abstract

Fine-tuning large language models (LLMs) in privacy-sensitive and resource-constrained environments remains challenging. Since training data are often distributed across multiple clients, decentralized fine-tuning offers a natural paradigm for collaborative adaptation without a central server. However, enabling full-parameter fine-tuning (FPFT) in this decentralized setting is difficult: FPFT provides strong adaptation capacity but incurs prohibitive resource consumption for billion-scale models. Existing decentralized LLM fine-tuning methods therefore mainly rely on parameter-efficient updates, which improve efficiency but may restrict downstream performance. Moreover, client data are typically non-IID, making decentralized optimization more vulnerable to client drift and unstable convergence. To address these challenges, we propose DECA, a resource-efficient decentralized FPFT framework for LLMs on non-IID data. DECA partitions model parameters into disjoint blocks and performs sequential block-wise Adam optimization, reducing resource consumption while preserving decentralized full-parameter adaptation. To stabilize training, DECA further introduces first- and second-order block-wise moment estimates with fresh local gradient statistics and consensus-derived discrepancy signals. We provide rigorous theoretical analysis and extensive experiments, showing that DECA achieves fast convergence, strong downstream performance, and significant resource efficiency.

Problem

Research questions and friction points this paper is trying to address.

decentralized fine-tuning

full-parameter fine-tuning

non-IID data

large language models

client drift

Innovation

Methods, ideas, or system contributions that make the work stand out.

decentralized fine-tuning

block-wise Adam

full-parameter fine-tuning