🤖 AI Summary
This work identifies a systemic bias in multilingual tokenization: morphologically complex and low-resource languages incur disproportionately higher computational costs and suffer lower accuracy due to token inflation—termed the “token tax.” We introduce the *token tax* as a quantitative measure of tokenization inefficiency and empirically demonstrate a strong negative correlation between tokenization ratio (tokens per word) and model accuracy. Using multiple-choice question-answering experiments across 16 African languages and 5 domains (AfriMMLU) with 10 large language models—and integrating economic cost modeling—we show that doubling token count quadruples training cost and time. Key contributions are: (1) establishing tokenization efficiency as a central determinant of linguistic fairness; (2) demonstrating that inference-time adaptation (e.g., via reasoning models) significantly narrows performance gaps between high- and low-resource languages; and (3) providing empirical evidence and quantitative tools to guide the design of linguistically equitable tokenization mechanisms.
📝 Abstract
Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).