🤖 AI Summary
Whether layer-wise activations across large language models (LLMs) with heterogeneous architectures share alignable representational geometry remains unclear.
Method: We propose a systematic framework based on nearest-neighbor graphs and high-dimensional geometric similarity measures to enable cross-model layer alignment and depth-normalized comparison across 24 open-source LLMs.
Contribution/Results: We discover— for the first time—that activation spaces at matched normalized depths exhibit highly consistent local neighborhood structures, forming robust, layerwise-evolving geometric patterns. Crucially, normalized depth—not absolute layer index—predicts cross-model activation similarity: nearest-neighbor matching accuracy at equivalent depths significantly exceeds both random baselines and cross-depth controls. This reveals an implicit, shared computational pathway across diverse LLMs, establishing a geometric foundation for model alignment, knowledge transfer, and interpretability research.
📝 Abstract
How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and find that they 1) tend to vary from layer to layer within a model, and 2) are approximately shared between corresponding layers of different models. Claim 2 shows that these nearest neighbor relationships are not arbitrary, as they are shared across models, but Claim 1 shows that they are not"obvious"either, as there is no single set of nearest neighbor relationships that is universally shared. Together, these suggest that LLMs generate a progression of activation geometries from layer to layer, but that this entire progression is largely shared between models, stretched and squeezed to fit into different architectures.