Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This study investigates the limitations of language models in accurately comparing numerical quantities expressed in different units (e.g., 110 cm vs. 1.2 m), particularly near decision boundaries where performance becomes fragile. Through controlled experiments employing linear proxy models, causal interventions, and subspace alignment techniques, the authors demonstrate that models do not rely on explicit unit normalization. Instead, they adopt heuristic strategies that separately process numerical differences and unit-scale discrepancies. The findings reveal that such errors are systematic and predictable, challenging the conventional assumption that explicit unit conversion is necessary for accurate comparisons. This work thus provides critical insights into the intrinsic reasoning mechanisms underlying quantitative comparison in large language models.

📝 Abstract

Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary, where small changes in value determine the correct answer. The resulting errors are systematic: linear surrogate models predict LM preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift model's output. The results suggest that LMs compare quantities through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation.

Problem

Research questions and friction points this paper is trying to address.

language models

quantity comparison

measurement units

numerical reasoning

heuristics

Innovation

Methods, ideas, or system contributions that make the work stand out.

language models

quantity comparison

heuristics