π€ AI Summary
To address excessive computational and memory overhead in deploying long-context Transformers, this paper proposes Hamiltonian Attention Distillation (HAD)βthe first attention mechanism incorporating Hamming distance by binarizing keys and queries to {β1, +1}, replacing dot products with Hamming-distance-based similarity computation, and jointly applying attention matrix sparsification to preserve representational capacity under stringent binary constraints. HAD enables hardware-software co-optimization for custom accelerators. Experiments show HAD incurs only a 1.78% accuracy drop on GLUE (a 7.3% improvement over prior state-of-the-art binarization methods) and a 2.5% drop on ImageNet (a 9.64% gain over SOTA). Hardware synthesis demonstrates 79% reduction in area and 87% lower power consumption. The core innovation lies in the synergistic integration of Hamming-distance-driven attention distillation and high-fidelity binarization design.
π Abstract
Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $mathbf{1.78}%$ performance losses on GLUE compared to $9.08%$ in state-of-the-art binarization work, and $mathbf{2.5}%$ performance losses on ImageNet compared to $12.14%$, all while targeting custom hardware with a $mathbf{79}%$ area reduction and $mathbf{87}%$ power reduction compared to its standard attention counterpart.