Higher-Order Token Interactions via Quantum Attention

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard self-attention mechanisms are limited to modeling pairwise (second-order) token interactions and struggle to efficiently capture higher-order dependencies. This work proposes Quantum Higher-order Attention (QHA), which leverages quantum data re-uploading, fully connected non-Clifford entangling gates, and local single-qubit measurements to realize arbitrary-order token interactions within a shallow circuit architecture, combined with a classical–quantum hybrid training strategy. Theoretically, QHA is proven to possess strictly greater expressive power than classical attention while avoiding barren plateaus. Empirically, QHA generalizes successfully to sixth-order hidden parity functions using only ~1/6.5 the parameters of conventional methods and achieves near-noise-limited performance with minimal parameter counts across diverse tasks—including genetic epistasis detection, noisy parity learning, and graph triangle finding—significantly outperforming traditional approaches.
📝 Abstract
Standard dot-product self-attention computes, in a single layer, only pairwise (order-2) interactions between tokens; representing a generic order-$k$ interaction is known to require either super-quadratic resources in one layer or composition across depth. We introduce \textbf{Quantum Higher-Order Attention (QHA)}, a shallow, hardware-realizable quantum attention head that, via data re-uploading and an all-to-all non-Clifford entangler, synthesizes order-$k$ token interactions inside the circuit and exposes them through a local single-qubit read-out. We prove (i) an expressivity separation: any single standard self-attention layer with embedding dimension $m$, $H$ heads and $p$-bit precision satisfying $mHp=o(N/\log\log N)$ cannot represent the order-$k$ correlation family that one QHA head represents with circuit depth $O(\log k)$ ($O(k)$ two-qubit gates); and (ii) a trainability guarantee for its local-design instantiation: with a local read-out and $O(\log n)$ depth the gradient variance is $Ω(1/\mathrm{poly}(n))$ (no barren plateau), which we confirm empirically -- while being explicit that the more expressive all-to-all instantiation we benchmark is trained empirically and shows exponentially decaying gradients. Empirically, at a $6.5\times$ smaller parameter budget, QHA generalizes hidden-subset parity of every order $k\le6$ from disjoint inputs, whereas the larger classical attention head collapses past order~2; consistent with theory, the size of the advantage tracks the target's Fourier degree - largest for parity and shrinking when low-order structure is present. As an application, QHA serves as a compact high-order interaction detector across three domains - genetic epistasis, learning-parity-with-noise, and graph triangle detection - reaching the noise ceiling at the smallest parameter budget where field-standard linear methods fail.
Problem

Research questions and friction points this paper is trying to address.

higher-order interactions
self-attention
quantum attention
token interactions
expressivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantum Attention
Higher-Order Interactions
Data Re-uploading
Expressivity Separation
Barren Plateau