🤖 AI Summary
This study investigates whether state-of-the-art large language models (LLMs) develop consistent internal interaction patterns during inference. Employing interaction-based interpretability methods, the authors systematically analyze low-order interaction structures across diverse model architectures and training strategies when predicting the same target token under identical prompts. The work reveals, for the first time, that these models exhibit significantly shared low-order interaction patterns, with notably weak positive-negative cancellation effects. Furthermore, greater model sophistication correlates with stronger interaction consistency, suggesting that advanced LLMs may be implicitly optimized toward a common reasoning paradigm.
📝 Abstract
Large language models (LLMs) differ in architecture, training data, and optimization procedures, yet they may still develop similar internal inference patterns. In this paper, we examine this hypothesis using interaction-based explanations. We find that LLMs often share interaction patterns when predicting the same target token from the same prompt. This consistency is more pronounced among advanced LLMs. Shared interactions also tend to be lower-order and show weaker positive-negative cancellation than non-shared interactions. These results suggest that advanced LLMs may be implicitly optimized toward common inference patterns, even though the mechanisms that give rise to such cross-model consistency remain open.