More Expressive Attention with Negative Weights

📅 2024-11-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional softmax attention enforces non-negative attention weights, leading to representation collapse and excessive compression of early tokens toward later positions. To address this, we propose Cog Attention—a novel mechanism that introduces learnable negative weights within single-head attention for the first time, thereby relaxing the non-negativity constraint. By dynamically modulating the sign of query-key dot products, Cog Attention enables a single attention head to jointly support diverse semantic operations—including deletion and copying—without architectural overhead. Furthermore, it decouples the functionality of the output-value (OV) projection matrix, restricting it to feature refinement rather than basic mapping. The mechanism integrates seamlessly into decoder-only language models and U-ViT diffusion models. Experiments demonstrate consistent and significant improvements over softmax baselines in both language modeling and image generation tasks, validating substantial gains in representational capacity, robustness, and cross-modal generalization afforded by negative-weight attention.

Technology Category

Application Category

📝 Abstract
We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances parameter flexibility. For example, unlike traditional softmax attention heads that use a static output-value (OV) matrix to delete or copy inputs that the heads attend to, Cog Attention naturally learns to use the sign of dynamic query-key (QK) inner products to represent these operations. This enables Cog Attention to perform multiple operations simultaneously within a single head. Meanwhile, Cog Attention's OV matrix can focus more on refinement or modification. (2) Cog Attention enhances the model's robustness against representational collapse by preventing the ``over-squashing'' of earlier tokens into later positions. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models at various scales for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.
Problem

Research questions and friction points this paper is trying to address.

Attention Mechanism
Robustness
Negative Weights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cog Attention
negative attention weights
expressiveness and robustness
🔎 Similar Papers
No similar papers found.