๐ค AI Summary
This work investigates the applicability of the Lottery Ticket Hypothesis (LTH) to Bayesian Neural Networks (BNNs), which, despite their ability to quantify uncertainty effectively, suffer from high computational costs. The study presents the first empirical validation of LTH in BNNs and introduces a novel joint pruning criterion that combines weight magnitude with standard deviation. Furthermore, it establishes a method for transferring subnetworks between Bayesian and deterministic โwinning tickets.โ Experimental results demonstrate that high-accuracy sparse subnetworks exist within BNNs, matching or even surpassing the performance of the original dense networks across most sparsity levels, with only minor degradation observed at extreme sparsity. The findings also reveal sensitivity to mask structure and initialization. This work thus opens a new avenue toward efficient Bayesian inference through structured sparsity.
๐ Abstract
Bayesian neural networks (BNNs) are a useful tool for uncertainty quantification, but require substantially more computational resources than conventional neural networks. For non-Bayesian networks, the Lottery Ticket Hypothesis (LTH) posits the existence of sparse subnetworks that can train to the same or even surpassing accuracy as the original dense network. Such sparse networks can lower the demand for computational resources at inference, and during training. The existence of the LTH and corresponding sparse subnetworks in BNNs could motivate the development of sparse training algorithms and provide valuable insights into the underlying training process. Towards this end, we translate the LTH experiments to a Bayesian setting using common computer vision models. We investigate the defining characteristics of Bayesian lottery tickets, and extend our study towards a transplantation method connecting BNNs with deterministic Lottery Tickets. We generally find that the LTH holds in BNNs, and winning tickets of matching and surpassing accuracy are present independent of model size, with degradation at very high sparsities. However, the pruning strategy should rely primarily on magnitude, secondly on standard deviation. Furthermore, our results demonstrate that models rely on mask structure and weight initialization to varying degrees.