🤖 AI Summary
It remains unclear whether self-attention in Transformers can independently perform logical reasoning or requires feed-forward layers to do so. Method: We construct a single-layer, hand-designed Transformer encoder and empirically evaluate it on an adjacent part-of-speech pair prediction task. Contribution/Results: Our study provides the first empirical evidence that self-attention alone suffices for basic logical inference and syntactic relation modeling—without any feed-forward layers. Further analysis characterizes the logical capacity limits of self-attention and identifies harmful zero-gradient regions where gradient descent stagnates. Based on this, we propose an explicit zero-point analysis and avoidance strategy, significantly improving training stability and generalization. This work delivers critical evidence and interpretable tools for understanding the intrinsic logical reasoning mechanisms within self-attention.
📝 Abstract
Transformers architecture apply self-attention to tokens represented as vectors, before a fully connected (neuronal network) layer. These two parts can be layered many times. Traditionally, self-attention is seen as a mechanism for aggregating information before logical operations are performed by the fully connected layer. In this paper, we show, that quite counter-intuitively, the logical analysis can also be performed within the self-attention. For this we implement a handcrafted single-level encoder layer which performs the logical analysis within self-attention. We then study the scenario in which a one-level transformer model undergoes self-learning using gradient descent. We investigate whether the model utilizes fully connected layers or self-attention mechanisms for logical analysis when it has the choice. Given that gradient descent can become stuck at undesired zeros, we explicitly calculate these unwanted zeros and find ways to avoid them. We do all this in the context of predicting grammatical category pairs of adjacent tokens in a text. We believe that our findings have broader implications for understanding the potential logical operations performed by self-attention.