Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies tokenization as a critical bottleneck limiting large language models’ (LLMs) symbolic and arithmetic reasoning capabilities. We find that mainstream subword tokenization schemes—such as Byte-Pair Encoding (BPE)—fragment atomic semantic units, disrupting logical alignment and undermining chain-of-thought (CoT) prompting. To address this, we propose the “Token Awareness” theoretical framework, which formally characterizes the alignment between token granularity and task structure. Through controlled tokenization manipulation experiments and systematic cross-model evaluation (e.g., GPT-4o-mini vs. o1) on arithmetic and symbolic reasoning benchmarks, we demonstrate that atomically aligned input representations significantly improve generalization—enabling smaller models to outperform larger ones. Our results reveal that symbolic reasoning performance is not solely determined by model scale or architecture, but is fundamentally constrained by the representational design at the tokenization level.

Technology Category

Application Category

📝 Abstract
Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs. This work presents a theoretical and empirical investigation into how tokenization schemes, particularly subword-based methods like byte-pair encoding (BPE), impede symbolic computation by merging or obscuring atomic reasoning units. We introduce the notion of Token Awareness to formalize how poor token granularity disrupts logical alignment and prevents models from generalizing symbolic procedures. Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate that token structure dramatically affect reasoning performance, causing failure even with CoT, while atomically-aligned formats unlock strong generalization, allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g., o1) in structured reasoning. Our findings reveal that symbolic reasoning ability in LLMs is not purely architectural, but deeply conditioned on token-level representations.
Problem

Research questions and friction points this paper is trying to address.

Tokenization limits symbolic reasoning in LLMs
Subword tokenization disrupts logical alignment in models
Token structure affects reasoning performance and generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Awareness formalizes token granularity impact
Subword tokenization disrupts symbolic computation
Atomically-aligned formats enhance reasoning performance
🔎 Similar Papers
No similar papers found.