Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a fundamental deficiency in multilingual large language models (MLLMs): an overreliance on orthographic similarity and an inability to robustly link orthography to semantics—particularly evident in English–Spanish/French/German heteronyms, where performance falls significantly below random chance. Using controlled lexical pairs (cognates, non-cognates, heteronyms) and evaluating both isolated-word discrimination and sentence-level semantic inference, we systematically uncover, for the first time, a “grapheme–semantics decoupling” phenomenon and language-asymmetric processing strategies. Key contributions are: (1) empirical evidence of semantic failure in cross-lingual ambiguity resolution; (2) demonstration that isolated-word discrimination capability does not correlate with genuine semantic understanding; and (3) identification of an English-dominant processing bias. These findings challenge the prevailing assumption of semantic consistency in multilingual representations and reveal critical limitations in MLLMs’ cross-lingual lexical semantics.

Technology Category

Application Category

📝 Abstract
Bilingual lexical processing is shaped by the complex interplay of phonological, orthographic, and semantic features of two languages within an integrated mental lexicon. In humans, this is evident in the ease with which cognate words - words similar in both orthographic form and meaning (e.g., blind, meaning"sightless"in both English and German) - are processed, compared to the challenges posed by interlingual homographs, which share orthographic form but differ in meaning (e.g., gift, meaning"present"in English but"poison"in German). We investigate how multilingual Large Language Models (LLMs) handle such phenomena, focusing on English-Spanish, English-French, and English-German cognates, non-cognate, and interlingual homographs. Specifically, we evaluate their ability to disambiguate meanings and make semantic judgments, both when these word types are presented in isolation or within sentence contexts. Our findings reveal that while certain LLMs demonstrate strong performance in recognizing cognates and non-cognates in isolation, they exhibit significant difficulty in disambiguating interlingual homographs, often performing below random baselines. This suggests LLMs tend to rely heavily on orthographic similarities rather than semantic understanding when interpreting interlingual homographs. Further, we find LLMs exhibit difficulty in retrieving word meanings, with performance in isolative disambiguation tasks having no correlation with semantic understanding. Finally, we study how the LLM processes interlingual homographs in incongruent sentences. We find models to opt for different strategies in understanding English and non-English homographs, highlighting a lack of a unified approach to handling cross-lingual ambiguities.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Models
Homograph Disambiguation
Cross-lingual Semantic Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Large Models
Polysemous Word Processing
Cross-lingual Semantic Understanding
🔎 Similar Papers
No similar papers found.
Eshaan Tanwar
Eshaan Tanwar
Københavns Universitet
Natural Language Processing
G
Gayatri Oke
Indian Institute of Technology Delhi, Department of Electrical Engineering, New Delhi, 110016, India
T
Tanmoy Chakraborty
Indian Institute of Technology Delhi, Department of Electrical Engineering, New Delhi, 110016, India