Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

πŸ“… 2026-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

135K/year
πŸ€– AI Summary
This study addresses the challenge of modeling semantic shifts between literal and figurative interpretations of idioms in multilingual natural language processing, particularly under low-resource conditions and in authentic contexts where current systems perform poorly. The authors introduce MIDI, a multilingual idiom dataset spanning 18 languages across high-, medium-, and low-resource settings, uniquely incorporating both sentence-level and dialogue-level context and annotated by native speakers to enable joint modeling and evaluation of both idiom usages. Through contextual embedding analysis, hidden-layer intervention studies, and benchmarking with multilingual large language models, the research reveals a significant performance drop in idiom understanding for low-resource languages and consistently greater difficulty in recognizing literal meanings compared to figurative ones. Although dialogue context provides some improvement, it remains insufficient to bridge the resource gap, highlighting fundamental limitations in current models’ memory and reasoning capabilities.
πŸ“ Abstract
Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.
Problem

Research questions and friction points this paper is trying to address.

multilingual idioms
figurative vs. literal meaning
low-resource languages
contextual understanding
NLP benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual idiom dataset
contextual idiom understanding
low-resource languages
figurative vs. literal interpretation
model reasoning vs. memorization
πŸ”Ž Similar Papers
S
Saeed Almheiri
Mohamed bin Zayed University of Artificial Intelligence
B
Bilal Elbouardi
Mohamed bin Zayed University of Artificial Intelligence
S
Salsabila Zahirah Pranida
Mohamed bin Zayed University of Artificial Intelligence
Irina Nikishina
Irina Nikishina
Postdoc @ University of Hamburg
Natural Language ProcessingRAGTaxonomiesQuestion Answering
A
Ashwath Rao B
Manipal University
P
Parameswari Krishnamurthy
IIIT Hyderabad
M
Muhammad Cendekia Airlangga
Mohamed bin Zayed University of Artificial Intelligence
R
Rifo Ahmad Genadi
Mohamed bin Zayed University of Artificial Intelligence
N
Nguyen Phan Gia Bao
University of Science and Technology of Hanoi
A
Amir Hossein Yari
Mohamed bin Zayed University of Artificial Intelligence
Hawau Olamide Toyin
Hawau Olamide Toyin
PhD student at MBZUAI
Speech Synthesis and RecognitionStuttering SpeechNLPML
Nurdaulet Mukhituly
Nurdaulet Mukhituly
PhD in NLP MBZUAI
Natural Language ProcessingMechanistic InterpretabilityAI Safety
M
Mena Attia
Mohamed bin Zayed University of Artificial Intelligence
B
Besher Hassan
Mohamed bin Zayed University of Artificial Intelligence
A
Ahmad Fathan Hidayatullah
Universitas Islam Indonesia
Tatsuki Kuribayashi
Tatsuki Kuribayashi
MBZUAI
Natural Language ProcessingComputational Psycholinguistics
H
Haonan Li
Mohamed bin Zayed University of Artificial Intelligence
Suma Bhat
Suma Bhat
University of Illinois at Urbana-Champaign
natural language processingeducational applications of AI
Fajri Koto
Fajri Koto
Assistant Professor (tenure-track), MBZUAI
Computational LinguisticsNatural Language ProcessingMultilingual NLPHuman-centered NLP