🤖 AI Summary
This work addresses the memory bandwidth bottleneck incurred by repeatedly loading all parameters when deploying deep neural networks on edge devices, a limitation exacerbated by conventional compression techniques that permanently sacrifice model capacity. To overcome this, the authors propose a dynamic inference framework that restructures a pretrained dense network into a binary tree architecture comprising a shared backbone, hierarchical routers, and specialized leaf nodes. Weight assignment is guided by activation-aware spherical k-means clustering, and soft routing fine-tuning enables activation of only a single path per inference. This approach decouples total parameter count from inference memory traffic while preserving the full parameter set. Evaluated on CIFAR-100, ImageNet-1K, and ModelNet40, the method reduces active parameters by 58–60% with no more than a 1.72 percentage point drop in Top-1 accuracy, outperforming static structured pruning baselines by 14–23 percentage points.
📝 Abstract
Deploying deep neural networks on memory-constrained edge accelerators is bottlenecked by per-inference off-chip weight transfer rather than computation: the dense network cannot be retained on-chip, and every parameter must be loaded for every input. Existing model compression reduces this transfer only at the cost of permanent capacity loss. We propose Sigma-Branch (SigmaB), a framework that restructures a pretrained dense network into a hierarchical binary tree composed of a shared backbone, hierarchical routers, and specialized leaves. Pretrained weights are distributed across the tree via activation-based spherical k-means clustering, which jointly initializes router weights and per-branch channel allocations; soft-routing fine-tuning then aligns each leaf with its routed input subset. At inference, the resulting network executes only a single root-to-leaf path, reducing the active-parameter footprint while storing the complete dense parameter set in memory. Across CIFAR-100 / ResNet-50, ImageNet-1K / ResNet-50, and ModelNet40 / PointNet++, SigmaB-Net reduces per-inference active parameters by 58-60% while remaining within 1.72 percentage points (pp) of the dense baseline Top-1. At comparable ImageNet-1K Top-1, the active-parameter reduction exceeds static structured pruning (FPGM, HRank) by 14-23 pp. The cross-modal evaluation, spanning 2D vision and 3D point-cloud backbones, substantiates a framework-level claim that decouples per-inference memory traffic from the total parameter count.