Nexus: Higher-Order Attention Mechanisms in Transformers

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Standard self-attention suffers from a low-rank bottleneck, limiting its capacity to model multi-hop dependencies within a single layer. To address this, we propose the Higher-order Attention Network (Hon), which introduces— for the first time—a nested attention recurrence mechanism: queries and keys are recursively updated, and multiple rounds of self-attention are dynamically executed within one layer to capture higher-order relational structures. Hon employs parameter sharing and dynamic nesting, incurring only *O*(1) additional parameters while theoretically breaking the linear-rank constraint inherent in standard attention. Empirical evaluation across multiple benchmark tasks demonstrates that Hon consistently outperforms standard Transformers, achieving superior modeling capacity and generalization performance without increasing computational overhead.

Technology Category

Application Category

📝 Abstract

Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the extbf{Higher-Order Attention Network (Hon)}, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Hon dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations extit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Hon outperforms standard Transformers on multiple benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Addresses low-rank bottleneck in standard first-order attention mechanisms.

Enhances ability to capture intricate multi-hop relationships in Transformers.

Proposes a parameter-efficient higher-order attention framework.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive nested self-attention for higher-order correlations

Dynamic refinement of queries and keys via inner attention loops

Parameter-efficient weight-sharing across recursive steps

🔎 Similar Papers

Disentangling and Integrating Relational and Sensory Information in Transformer Architectures