ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Transformers struggle to simultaneously achieve fine-grained positional modeling and flexible multi-head attention (MHA) for long-range dependencies: existing approaches either decouple semantic and positional encoding or impose uniform positional biases across all heads, compromising representational capacity. This paper proposes ComplexFormer, which introduces head-specific Complex-valued Multi-Head Attention (CMHA), unifying token interactions as rotation and scaling operations in the complex plane. Its core innovations are head-wise Euler transformations and an adaptive differential rotation mechanism, enabling dynamic, heterogeneous fusion of semantic angular differences and relative positional encodings. Integrated with complex-valued neural networks, polar-coordinate projection, and rotation-based gating, ComplexFormer achieves significantly lower perplexity on language modeling, code generation, and mathematical reasoning tasks, improves long-context coherence, and attains superior parameter efficiency compared to strong baselines such as RoPE-Transformer.

Technology Category

Application Category

📝 Abstract

Transformer models rely on self-attention to capture token dependencies but face challenges in effectively integrating positional information while allowing multi-head attention (MHA) flexibility. Prior methods often model semantic and positional differences disparately or apply uniform positional adjustments across heads, potentially limiting representational capacity. This paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA. CMHA empowers each head to independently model semantic and positional differences unified within the complex plane, representing interactions as rotations and scaling. ComplexFormer incorporates two key improvements: (1) a per-head Euler transformation, converting real-valued query/key projections into polar-form complex vectors for head-specific complex subspace operation; and (2) a per-head adaptive differential rotation mechanism, exp[i(Adapt(ASmn,i) + Delta(Pmn),i)], allowing each head to learn distinct strategies for integrating semantic angle differences (ASmn,i) with relative positional encodings (Delta(Pmn),i). Extensive experiments on language modeling, text generation, code generation, and mathematical reasoning show ComplexFormer achieves superior performance, significantly lower generation perplexity , and improved long-context coherence compared to strong baselines like RoPE-Transformers. ComplexFormer demonstrates strong parameter efficiency, offering a more expressive, adaptable attention mechanism.

Problem

Research questions and friction points this paper is trying to address.

Enhancing Transformer inference via complex vector attention

Improving positional and semantic integration in multi-head attention

Achieving better performance with adaptive differential rotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Head-specific complex vector attention for transformers

Per-head Euler transformation for complex subspace operations

Adaptive differential rotation for semantic and positional integration

🔎 Similar Papers

Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing