🤖 AI Summary
This work addresses a subtle yet critical issue in data-parallel full-parameter fine-tuning: the presence of “silent inconsistency” among worker nodes, where divergent optimization dynamics prior to gradient aggregation remain undetected by conventional monitoring, leading to training instability. To diagnose this phenomenon, the authors propose a lightweight, model-agnostic framework that quantifies inter-node optimization consistency using three metrics—loss dispersion, gradient norm dispersion, and cross-node cosine similarity of gradient directions—derived solely from signals available in standard training pipelines with minimal overhead. Experiments on an 8-NPU setup fine-tuning a 1B-parameter model demonstrate that even when global loss curves appear smooth, asynchronous data shuffling and unsynchronized random seeds can significantly exacerbate node-level inconsistencies, thereby validating both the effectiveness and necessity of the proposed diagnostic approach.
📝 Abstract
Data-parallel (DP) training with synchronous all-reduce is a dominant paradigm for full-parameter fine-tuning of large language models (LLMs). While parameter synchronization guarantees numerical equivalence of model weights after each iteration, it does not necessarily imply alignment of worker-level optimization dynamics before gradient aggregation. This paper identifies and studies this latent mismatch, termed \emph{silent inconsistency}, where cross-worker divergence in losses and gradients can remain invisible under conventional aggregated monitoring signals. We propose a lightweight, model-agnostic diagnostic framework that quantifies worker-level consistency using training signals readily available in standard pipelines. Specifically, we introduce three complementary metrics: loss dispersion, gradient-norm dispersion, and gradient-direction consistency measured by inter-worker cosine similarity. The proposed metrics incur negligible overhead and require no modification to model architecture, synchronization mechanisms, or optimization algorithms. We validate the framework by fully fine-tuning the 1B-parameter \texttt{openPangu-Embedded-1B-V1.1} model on the \texttt{tatsu-lab/alpaca} dataset using an 8-NPU DP setup, under controlled perturbations of cross-rank stochasticity. Experimental results show that progressively desynchronized data shuffling and random seeds lead to substantial increases in loss/gradient dispersion and reduced directional alignment, despite smooth globally averaged loss curves. These findings demonstrate that the proposed indicators provide actionable visibility into hidden instability modes in large-scale DP fine-tuning, enabling more reliable diagnosis and configuration assessment.