🤖 AI Summary
This study systematically investigates large language models’ (LLMs) ability to model continuous numerical values, revealing an inherent discontinuity and high noise in their embedding spaces: while integers are reconstructed with high fidelity (R² ≥ 0.95), principal components explain only a small fraction of variance, and performance degrades markedly with increasing fractional precision. Using linear reconstruction, PCA, and embedding variance analysis, we empirically validate—across embedding layers of leading models including OpenAI, Gemini, and Voyage AI—that numbers are not continuously represented but rather discretely encoded, resulting in inefficient dimensional utilization. This work constitutes the first systematic demonstration of a fundamental limitation in LLM numerical embeddings. It provides critical evidence of embedding mechanism deficiencies for high-precision numerical tasks—such as scientific computing and financial modeling—and identifies concrete directions for architectural and representational improvement.
📝 Abstract
Recent research has extensively studied how large language models manipulate integers in specific arithmetic tasks, and on a more fundamental level, how they represent numeric values. These previous works have found that language model embeddings can be used to reconstruct the original values, however, they do not evaluate whether language models actually model continuous values as continuous. Using expected properties of the embedding space, including linear reconstruction and principal component analysis, we show that language models not only represent numeric spaces as non-continuous but also introduce significant noise. Using models from three major providers (OpenAI, Google Gemini and Voyage AI), we show that while reconstruction is possible with high fidelity ($R^2 geq 0.95$), principal components only explain a minor share of variation within the embedding space. This indicates that many components within the embedding space are orthogonal to the simple numeric input space. Further, both linear reconstruction and explained variance suffer with increasing decimal precision, despite the ordinal nature of the input space being fundamentally unchanged. The findings of this work therefore have implications for the many areas where embedding models are used, in-particular where high numerical precision, large magnitudes or mixed-sign values are common.