Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This study investigates the performance degradation of multilingual large language models (LLMs) on dialectal NLP tasks induced by subtle linguistic variations, focusing on tokenization and semantic representation biases. We propose two novel metrics—Tokenization Parity (TP) and Information Parity (IP)—to quantify tokenization consistency and semantic fidelity, respectively, and systematically evaluate their predictive power across downstream tasks including dialect classification, topic classification, and extractive question answering. Leveraging mainstream decoder-only and encoder-based models, we conduct tokenizer behavior analysis, vocabulary coverage measurement, and qualitative studies under diverse script and resource conditions. Results show that TP better predicts performance on syntactically or morphologically sensitive tasks, whereas IP correlates more strongly with semantically intensive tasks. These findings expose a fundamental mismatch between vendors’ claims of “language support” and the actual representational capacity of multilingual LLMs for low-resource dialectal varieties.

Technology Category

Application Category

📝 Abstract

Dialectal data are characterized by linguistic variation that appears small to humans but has a significant impact on the performance of models. This dialect gap has been related to various factors (e.g., data size, economic and social factors) whose impact, however, turns out to be inconsistent. In this work, we investigate factors impacting the model performance more directly: we correlate Tokenization Parity (TP) and Information Parity (IP), as measures of representational biases in pre-trained multilingual models, with the downstream performance. We compare state-of-the-art decoder-only LLMs with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering, controlling for varying scripts (Latin vs. non-Latin) and resource availability (high vs. low). Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Complementary analyses, including tokenizer behavior, vocabulary coverage, and qualitative insights, reveal that the language support claims of LLMs often might mask deeper mismatches at the script or token level.

Problem

Research questions and friction points this paper is trying to address.

Investigating tokenization and representation biases in multilingual models on dialectal NLP tasks

Analyzing how Tokenization Parity and Information Parity affect downstream dialectal task performance

Examining performance gaps across scripts and resource levels in dialectal language processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Measures Tokenization Parity and Information Parity

Compares decoder-only and encoder-based multilingual models

Analyzes tokenizer behavior and vocabulary coverage mismatches

🔎 Similar Papers

No similar papers found.