Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep neural networks suffer from poor biological plausibility, high memory overhead, and gradient instability (e.g., vanishing/exploding gradients) due to reliance on global cross-entropy loss and backpropagation. To address these issues, this paper proposes a greedy, layer-wise training framework grounded in the information bottleneck (IB) principle. It is the first work to theoretically characterize CNN layer-wise convergence from an information-theoretic perspective. We introduce deterministic information bottleneck (DIB) objectives combined with matrix-based Rényi α-order entropy to formulate layer-wise optimization targets—eliminating the need for backpropagation. Auxiliary classifiers are incorporated to stabilize representation learning. Evaluated on CIFAR-10, CIFAR-100, and traffic sign recognition benchmarks, our method achieves accuracy comparable to SGD-based end-to-end training, significantly outperforming existing layer-wise approaches. Moreover, it reduces memory consumption and mitigates gradient instability, offering both computational efficiency and neuroscientific interpretability.

Technology Category

Application Category

📝 Abstract
Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based Rényi's $α$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
Problem

Research questions and friction points this paper is trying to address.

Developing biologically plausible layer-wise training without backpropagation
Analyzing CNN training dynamics through information-theoretic principles
Proposing greedy layer-wise training using deterministic information bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Greedy layer-wise training eliminates backpropagation
Uses deterministic information bottleneck for representation learning
Each layer trained with auxiliary classifier for minimal representations
🔎 Similar Papers
No similar papers found.
S
Shuyan Lyu
School of Software, Taiyuan University of Technology, Taiyuan, 100190, Shanxi, China.
Z
Zhanzimo Wu
Department of Mathematics and Statistical Science, University College London, London, WC1E 6BT, United Kingdom.
Junliang Du
Junliang Du
Shanghai Jiao Tong University
Bayesian Methodology