Joint or Disjoint: Mixing Training Regimes for Early-Exit Models

📅 2024-07-19

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing early-exit model training paradigms—joint and separate—lack theoretical grounding and systematic empirical evaluation. Method: We formalize training mechanisms into three standardized categories: joint, separate, and hybrid; and propose a staged training strategy: first independently training the backbone, then jointly optimizing the backbone and exit classifiers. We conduct rigorous analysis via information bottleneck theory, loss curvature modeling, numerical rank estimation of activation matrices, and extensive experiments across multiple architectures (ResNet, ViT) and datasets (CIFAR, ImageNet). Contribution/Results: Our study uncovers principled patterns governing how training paradigms interact with model architecture and data characteristics. The hybrid paradigm achieves superior accuracy–latency trade-offs: up to 18% inference speedup on ImageNet with <0.3% top-1 accuracy degradation. This work establishes an interpretable, reproducible foundation for training multi-exit models, bridging theoretical insight with practical performance gains.

Technology Category

Application Category

📝 Abstract

Early exits are an important efficiency mechanism integrated into deep neural networks that allows for the termination of the network's forward pass before processing through all its layers. By allowing early halting of the inference process for less complex inputs that reached high confidence, early exits significantly reduce the amount of computation required. Early exit methods add trainable internal classifiers which leads to more intricacy in the training process. However, there is no consistent verification of the approaches of training of early exit methods, and no unified scheme of training such models. Most early exit methods employ a training strategy that either simultaneously trains the backbone network and the exit heads or trains the exit heads separately. We propose a training approach where the backbone is initially trained on its own, followed by a phase where both the backbone and the exit heads are trained together. Thus, we advocate for organizing early-exit training strategies into three distinct categories, and then validate them for their performance and efficiency. In this benchmark, we perform both theoretical and empirical analysis of early-exit training regimes. We study the methods in terms of information flow, loss landscape and numerical rank of activations and gauge the suitability of regimes for various architectures and datasets.

Problem

Research questions and friction points this paper is trying to address.

Analyzing training strategy impacts on multi-exit network performance

Comparing joint versus disjoint early-exit training approaches

Proposing mixed training strategy for improved efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed mixed training strategy for multi-exit models

Backbone trained first then entire network optimized

Comprehensive evaluation across architectures and datasets

🔎 Similar Papers

An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning