The Token Tax: Systematic Bias in Multilingual Tokenization

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work identifies a systemic bias in multilingual tokenization: morphologically complex and low-resource languages incur disproportionately higher computational costs and suffer lower accuracy due to token inflation—termed the “token tax.” We introduce the *token tax* as a quantitative measure of tokenization inefficiency and empirically demonstrate a strong negative correlation between tokenization ratio (tokens per word) and model accuracy. Using multiple-choice question-answering experiments across 16 African languages and 5 domains (AfriMMLU) with 10 large language models—and integrating economic cost modeling—we show that doubling token count quadruples training cost and time. Key contributions are: (1) establishing tokenization efficiency as a central determinant of linguistic fairness; (2) demonstrating that inference-time adaptation (e.g., via reasoning models) significantly narrows performance gaps between high- and low-resource languages; and (3) providing empirical evidence and quantitative tools to guide the design of linguistically equitable tokenization mechanisms.

Technology Category

Application Category

📝 Abstract

Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).

Problem

Research questions and friction points this paper is trying to address.

Tokenization inefficiency disadvantages morphologically complex languages

Higher token fertility predicts lower model accuracy

Token inflation quadruples training costs and time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Morphologically aware tokenization reduces bias

Fertility predicts accuracy in multilingual models

Reasoning models outperform peers across languages

🔎 Similar Papers

No similar papers found.