Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses a critical flaw in the use of mean-pooled cosine similarity for comparing cross-lingual and cross-modal representations: its lack of length invariance, which causes the metric to be heavily influenced by sequence length and often leads to misleading conclusions. The study is the first to systematically uncover the theoretical mechanism underlying this monotonic dependence on sequence length under anisotropic representations. To mitigate this issue, the authors propose replacing cosine similarity with length-invariant alternatives such as Centered Kernel Alignment (CKA). Through rigorous theoretical analysis, AST-based depth probing, and extensive experiments across diverse models—including code LLMs, Mistral-7B, and CLIP ViT-B/32—on benchmarks like HumanEvalPack, WMT, and CLIP, they demonstrate that sequence length accounts for 52%–75% of the variance in cosine similarity scores, whereas CKA reduces this to below 1%, substantially enhancing the reliability of representation comparisons.

📝 Abstract

Mean-pooled cosine similarity is the default metric for comparing neural representations across languages, modalities, and tasks. We establish that this metric is not length-invariant: under the anisotropy that characterizes modern transformer representations, mean-pooled cosine grows monotonically in sequence length, independent of representational content. Empirically, on HumanEvalPack across four code LLMs, the length ratio alone explains $R^2 = 0.52$--$0.75$ of cross-language "Python proximity," while AST depth and shared-token fraction add less than 3% of explained variance beyond length. Substituting Centered Kernel Alignment (CKA) reduces explained variance by 83% and reverses the sign of the length coefficient ($β_{\mathrm{len}}: +0.86 \to -0.37$). The same pattern holds in Mistral-7B on parallel WMT pairs ($R^2 = 0.23$ EN-FR, $R^2 = 0.33$ EN-DE for cosine; $R^2 < 0.01$ for CKA). In CLIP ViT-B/32, mean-pooling reduces the length effect relative to EOS-pooling ($R^2: 0.21 \to {<}0.01$), as predicted by the theory's dependence on anisotropy. We argue that length-invariant metrics such as CKA should be the default for cross-representation comparisons, and that recent claims of cross-lingual representational convergence built on mean-pooled cosine warrant re-examination.

Problem

Research questions and friction points this paper is trying to address.

length-invariance

mean-pooled cosine similarity

neural representation comparison

anisotropy

cross-lingual similarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

length-invariance

mean-pooled cosine similarity

Centered Kernel Alignment