The Token Tax: Systematic Bias in Multilingual Tokenization

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a systemic bias in multilingual tokenization: morphologically complex and low-resource languages incur disproportionately higher computational costs and suffer lower accuracy due to token inflation—termed the “token tax.” We introduce the *token tax* as a quantitative measure of tokenization inefficiency and empirically demonstrate a strong negative correlation between tokenization ratio (tokens per word) and model accuracy. Using multiple-choice question-answering experiments across 16 African languages and 5 domains (AfriMMLU) with 10 large language models—and integrating economic cost modeling—we show that doubling token count quadruples training cost and time. Key contributions are: (1) establishing tokenization efficiency as a central determinant of linguistic fairness; (2) demonstrating that inference-time adaptation (e.g., via reasoning models) significantly narrows performance gaps between high- and low-resource languages; and (3) providing empirical evidence and quantitative tools to guide the design of linguistically equitable tokenization mechanisms.

Technology Category

Application Category

📝 Abstract
Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
Problem

Research questions and friction points this paper is trying to address.

Tokenization inefficiency disadvantages morphologically complex languages
Higher token fertility predicts lower model accuracy
Token inflation quadruples training costs and time
Innovation

Methods, ideas, or system contributions that make the work stand out.

Morphologically aware tokenization reduces bias
Fertility predicts accuracy in multilingual models
Reasoning models outperform peers across languages
🔎 Similar Papers
No similar papers found.
J
Jessica M. Lundin
Institute for Disease Modeling
A
Ada Zhang
University of San Francisco
N
Nihal Karim
University of San Francisco
H
Hamza Louzan
University of San Francisco
V
Victor Wei
University of San Francisco
D
David Adelani
McGill University
Cody Carroll
Cody Carroll
University of San Francisco
biostatisticsnonparametric statisticsfunctional data analysisconservation technology