Silent Inconsistency in Data-Parallel Full Fine-Tuning: Diagnosing Worker-Level Optimization Misalignment

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a subtle yet critical issue in data-parallel full-parameter fine-tuning: the presence of “silent inconsistency” among worker nodes, where divergent optimization dynamics prior to gradient aggregation remain undetected by conventional monitoring, leading to training instability. To diagnose this phenomenon, the authors propose a lightweight, model-agnostic framework that quantifies inter-node optimization consistency using three metrics—loss dispersion, gradient norm dispersion, and cross-node cosine similarity of gradient directions—derived solely from signals available in standard training pipelines with minimal overhead. Experiments on an 8-NPU setup fine-tuning a 1B-parameter model demonstrate that even when global loss curves appear smooth, asynchronous data shuffling and unsynchronized random seeds can significantly exacerbate node-level inconsistencies, thereby validating both the effectiveness and necessity of the proposed diagnostic approach.

Technology Category

Application Category

📝 Abstract
Data-parallel (DP) training with synchronous all-reduce is a dominant paradigm for full-parameter fine-tuning of large language models (LLMs). While parameter synchronization guarantees numerical equivalence of model weights after each iteration, it does not necessarily imply alignment of worker-level optimization dynamics before gradient aggregation. This paper identifies and studies this latent mismatch, termed \emph{silent inconsistency}, where cross-worker divergence in losses and gradients can remain invisible under conventional aggregated monitoring signals. We propose a lightweight, model-agnostic diagnostic framework that quantifies worker-level consistency using training signals readily available in standard pipelines. Specifically, we introduce three complementary metrics: loss dispersion, gradient-norm dispersion, and gradient-direction consistency measured by inter-worker cosine similarity. The proposed metrics incur negligible overhead and require no modification to model architecture, synchronization mechanisms, or optimization algorithms. We validate the framework by fully fine-tuning the 1B-parameter \texttt{openPangu-Embedded-1B-V1.1} model on the \texttt{tatsu-lab/alpaca} dataset using an 8-NPU DP setup, under controlled perturbations of cross-rank stochasticity. Experimental results show that progressively desynchronized data shuffling and random seeds lead to substantial increases in loss/gradient dispersion and reduced directional alignment, despite smooth globally averaged loss curves. These findings demonstrate that the proposed indicators provide actionable visibility into hidden instability modes in large-scale DP fine-tuning, enabling more reliable diagnosis and configuration assessment.
Problem

Research questions and friction points this paper is trying to address.

silent inconsistency
data-parallel training
optimization misalignment
worker-level divergence
large language model fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

silent inconsistency
data-parallel training
worker-level optimization
gradient consistency
diagnostic framework
🔎 Similar Papers
No similar papers found.
H
Hong Li
School of Transportation, Southeast University
Zhen Zhou
Zhen Zhou
Southeast University
Urban ComputingMachine Learning
H
Honggang Zhang
Department of Logistics and Maritime Studies, The Hong Kong Polytechnic University
Y
Yuping Luo
School of Transportation, Southeast University
X
Xinyue Wang
School of Transportation, Southeast University
Han Gong
Han Gong
Apple Inc.
ColorColor ImagingComputer VisionImage Processing
Z
Zhiyuan Liu
School of Transportation, Southeast University