🤖 AI Summary
This work addresses the high computational overhead of HQC decoding on mobile and embedded devices, primarily caused by inefficient Reed–Muller and Reed–Solomon components. Targeting Hexagon processors with integrated NPUs, the study presents the first fully vectorized implementation of the entire HQC decoding pipeline. Key contributions include HVX-friendly designs for the Hadamard transform, peak selection, finite-field arithmetic, and syndrome computation, alongside optimized data layouts and execution patterns. Evaluated on the Snapdragon 8 Gen 2 platform, the proposed approach achieves up to an 18.13× improvement in energy efficiency compared to the baseline, substantially reducing both latency and power consumption while effectively offloading the CPU.
📝 Abstract
Hamming Quasi-Cyclic (HQC) has been selected by NIST for standardization as an additional code-based key-encapsulation mechanism, providing algorithmic diversity alongside lattice-based post-quantum cryptography. Efficient deployment of HQC on mobile and embedded platforms, however, requires careful optimization of its decoding procedure, whose Reed-Muller and Reed-Solomon components dominate the computational cost. This paper studies HQC decoding on Qualcomm Hexagon processors in NPU-integrated devices, focusing on the Hexagon Vector eXtensions (HVX) backend rather than a tensor-inference engine. We observe that HQC decoding naturally exposes vector-structured computation, including Reed-Muller reliability vectors, Hadamard-transform coefficients, Reed-Solomon syndrome vectors, finite-field products, and packed support-point evaluations. Based on this observation, we redesign the dominant decoding kernels around HVX-friendly data layouts and execution patterns, including a vectorized Reed-Muller Hadamard transform, scalar-equivalent peak selection, HVX-oriented finite-field arithmetic, vectorized syndrome computation, and shortened-support locator-root evaluation. We implement and evaluate the optimized decoder using both Hexagon simulator measurements and real-device experiments on a Snapdragon~8 Gen~2 hardware development kit. The results show that Hexagon/HVX-assisted decoding substantially reduces latency and energy consumption, improving energy efficiency by up to $18.13\times$ while significantly offloading host CPU work. These results indicate that NPU-integrated mobile platforms can serve as effective backends for structured post-quantum cryptographic decoding when the underlying kernels are reformulated around vector execution.