Understanding the Training Speedup from Sampling with Approximate Losses

📅 2024-02-10
🏛️ International Conference on Machine Learning
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead of sample selection in large-model training caused by precise loss computation, this paper proposes an efficient dynamic sampling method based on approximate loss estimation. Theoretically, it establishes for the first time the relationship between approximation error in loss estimation and the required number of convergence iterations. We design SIFT, a parameter-free sampling algorithm that integrates an early-exit mechanism to guarantee convergence while drastically reducing selection cost. SIFT leverages intermediate-layer representations to rapidly estimate per-sample losses, enabling greedy, on-the-fly sample selection. Experiments on BERT-base (110M) demonstrate that SIFT achieves 64% validation accuracy in just 43 hours—compared to 57 hours under standard training—while significantly reducing both backward-pass count and total training time. Crucially, these gains are achieved without relying on hardware-specific or framework-level optimizations.

Technology Category

Application Category

📝 Abstract
It is well known that selecting samples with large losses/gradients can significantly reduce the number of training steps. However, the selection overhead is often too high to yield any meaningful gains in terms of overall training time. In this work, we focus on the greedy approach of selecting samples with large extit{approximate losses} instead of exact losses in order to reduce the selection overhead. For smooth convex losses, we show that such a greedy strategy can converge to a constant factor of the minimum value of the average loss in fewer iterations than the standard approach of random selection. We also theoretically quantify the effect of the approximation level. We then develop SIFT which uses early exiting to obtain approximate losses with an intermediate layer's representations for sample selection. We evaluate SIFT on the task of training a 110M parameter 12-layer BERT base model and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e.g., to reach 64% validation accuracy, SIFT with exit at the first layer takes ~43 hours compared to ~57 hours of vanilla training.
Problem

Research questions and friction points this paper is trying to address.

Reducing sample selection overhead in training
Using approximate losses for faster convergence
Improving training speed without optimized implementation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses approximate losses for sample selection
Employs early exiting to reduce computation
Demonstrates faster convergence with SIFT