🤖 AI Summary
This work addresses the challenge of end-to-end training and limited performance of Binary Neural Networks (BNNs) and Spiking Neural Networks (SNNs) in the absence of normalization layers. We propose a unified Bayesian modeling framework: (1) an Importance-Weighted Straight-Through (IW-ST) estimator family, with theoretical analysis of its bias–variance trade-off; (2) the first variational inference framework for Spiking BNNs, leveraging posterior noise for implicit regularization and low-bias gradient optimization; and (3) elimination of handcrafted gradient approximations and dependence on normalization layers. Our method achieves state-of-the-art or competitive performance on CIFAR-10, DVS Gesture, and SHD benchmarks—demonstrating, for the first time, stable end-to-end training of deep residual BNNs and SNNs without any normalization layers.
📝 Abstract
We propose a Bayesian framework for training binary and spiking neural networks that achieves state-of-the-art performance without normalisation layers. Unlike commonly used surrogate gradient methods -- often heuristic and sensitive to hyperparameter choices -- our approach is grounded in a probabilistic model of noisy binary networks, enabling fully end-to-end gradient-based optimisation. We introduce importance-weighted straight-through (IW-ST) estimators, a unified class generalising straight-through and relaxation-based estimators. We characterise the bias-variance trade-off in this family and derive a bias-minimising objective implemented via an auxiliary loss. Building on this, we introduce Spiking Bayesian Neural Networks (SBNNs), a variational inference framework that uses posterior noise to train Binary and Spiking Neural Networks with IW-ST. This Bayesian approach minimises gradient bias, regularises parameters, and introduces dropout-like noise. By linking low-bias conditions, vanishing gradients, and the KL term, we enable training of deep residual networks without normalisation. Experiments on CIFAR-10, DVS Gesture, and SHD show our method matches or exceeds existing approaches without normalisation or hand-tuned gradients.