PETRA: Parallel End-to-end Training with Reversible Architectures

📅 2024-06-04

🏛️ International Conference on Learning Representations

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the parallelization bottlenecks and GPU memory redundancy caused by tight coupling between forward and backward passes in deep model training, this paper proposes PETRA—the first end-to-end stage-level parallel training framework built on reversible architectures. Its core contributions are: (1) a novel weight-snapshot-free model-parallel paradigm that decouples forward and backward computation via reversible neural networks, eliminating intermediate activation storage; (2) systematic integration of reversibility into the gradient computation pipeline, enabling device-wise stage-level independent training; and (3) a lightweight, autograd-like custom training engine supporting seamless adaptation of mainstream architectures (e.g., ResNet). Experiments on CIFAR-10, ImageNet32, and ImageNet demonstrate that PETRA matches standard backpropagation accuracy while reducing GPU memory consumption by 37%–52% and improving multi-GPU training throughput by 1.8×–2.3×.

Technology Category

Application Category

📝 Abstract

Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep learning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and keeping a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models.

Problem

Research questions and friction points this paper is trying to address.

Parallelizing deep model training with reversible architectures

Eliminating weight stashing through decoupled forward-backward passes

Enabling independent stage computation across multiple devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel gradient computation via reversible architectures

Decoupled forward-backward passes eliminating weight stashing

Custom autograd framework enabling independent stage computation

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models