The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This study addresses the discrepancy between large language models (LLMs) acting as evaluators and human preferences, despite high inter-model agreement. Introducing a geometric analysis framework, the authors systematically assess alignment between LLM judgment subspaces and human judgments across multiple languages and datasets, employing metrics such as principal angles, effective rank, score distribution span, and stacked correlations, complemented by bootstrap confidence intervals. Findings reveal that LLM evaluation axes are nearly orthogonal to human judgments in subjective tasks, with their internal consistency stemming from shared biases rather than genuine alignment. Only post-hoc calibration simultaneously improves multiple alignment metrics, enabling a 24B Indic model to surpass GPT-5.5—yet still fall short of human reliability. The work argues that model consensus must undergo geometric consistency validation to serve as credible evidence of alignment.
📝 Abstract
LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard LLM-as-judge stack across four community-built Indic datasets, eight Indic languages, and 41 LLM judges: score spread, effective rank, principal angle to the human subspace, and stacked correlations among judges and humans, all with bootstrap confidence intervals. On subjective rubrics, judges use less than half the human score range ($σ_J / σ_H \approx 0.3$--$0.5$). Their evaluation axis is nearly orthogonal to the human one and noticeably further from humans than humans are from each other ($87^\circ$--$89^\circ$ versus $78^\circ$--$81^\circ$). Inter-LLM agreement exceeds LLM--human agreement ($r_{LL} \approx 0.35$ versus $r_{LH} \approx 0.27$--$0.32$). On a rubric with a verifiable factual answer, the same diagnostics fall back into the human range (axis $58.5^\circ$; $r_{LH} = 0.519$). Fine-tuning and preference optimization recover spread ($0.32 \rightarrow 1.08$) but barely move the axis (still $87^\circ$--$88^\circ$). Only post-hoc calibration on a small human-anchored set improves all four community-health rubrics together, placing a calibrated 24B Indic judge ($r = 0.184$) ahead of GPT-5.5 ($r = 0.123$), yet still short of human reliability (human-human $r = 0.474$ on the verifiable rubric). We argue that inter-LLM agreement should be considered evidence of human alignment only when a direct geometric check on the judge's score subspace passes; otherwise, the consensus reflects agreement within a collapsed subspace.
Problem

Research questions and friction points this paper is trying to address.

LLM-as-judge
human alignment
inter-LLM consensus
evaluation bias
geometric analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-Judge
geometric alignment
human disagreement
subspace analysis
post-hoc calibration