🤖 AI Summary
Deep neural networks suffer from poor biological plausibility, high memory overhead, and gradient instability (e.g., vanishing/exploding gradients) due to reliance on global cross-entropy loss and backpropagation. To address these issues, this paper proposes a greedy, layer-wise training framework grounded in the information bottleneck (IB) principle. It is the first work to theoretically characterize CNN layer-wise convergence from an information-theoretic perspective. We introduce deterministic information bottleneck (DIB) objectives combined with matrix-based Rényi α-order entropy to formulate layer-wise optimization targets—eliminating the need for backpropagation. Auxiliary classifiers are incorporated to stabilize representation learning. Evaluated on CIFAR-10, CIFAR-100, and traffic sign recognition benchmarks, our method achieves accuracy comparable to SGD-based end-to-end training, significantly outperforming existing layer-wise approaches. Moreover, it reduces memory consumption and mitigates gradient instability, offering both computational efficiency and neuroscientific interpretability.
📝 Abstract
Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based Rényi's $α$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.