๐ค AI Summary
Standard self-attention suffers from a low-rank bottleneck, limiting its capacity to model multi-hop dependencies within a single layer. To address this, we propose the Higher-order Attention Network (Hon), which introducesโ for the first timeโa nested attention recurrence mechanism: queries and keys are recursively updated, and multiple rounds of self-attention are dynamically executed within one layer to capture higher-order relational structures. Hon employs parameter sharing and dynamic nesting, incurring only *O*(1) additional parameters while theoretically breaking the linear-rank constraint inherent in standard attention. Empirical evaluation across multiple benchmark tasks demonstrates that Hon consistently outperforms standard Transformers, achieving superior modeling capacity and generalization performance without increasing computational overhead.
๐ Abstract
Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the extbf{Higher-Order Attention Network (Hon)}, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Hon dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations extit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Hon outperforms standard Transformers on multiple benchmarks.