๐ค AI Summary
Current large language models often struggle with culturally and metaphorically dense idioms, frequently misinterpreting them through literal semantics while overlooking their intended meanings. To address this limitation, this work introduces Mediom, the first high-quality multimodal idiom corpus covering Hindi, Bengali, and Thai, alongside HIDEโa novel prompt-based framework for idiom interpretation. HIDE integrates multilingual large language models with vision-language models and employs iterative refinement through error-feedback retrieval and diagnostic prompting to enhance non-literal reasoning. Experimental results demonstrate that HIDE substantially mitigates systematic deficiencies in cross-cultural idiom comprehension exhibited by existing models when evaluated on the Mediom benchmark.
๐ Abstract
Idiomatic reasoning, deeply intertwined with metaphor and culture, remains a blind spot for contemporary language models, whose progress skews toward surface-level lexical and semantic cues. For instance, the Bengali idiom \textit{\foreignlanguage{bengali}{\char"0986\char"0999\char"09CD\char"0997\char"09C1 \char"09B0 \char"09AB\char"09B2 \char"099F\char"0995}} (angur fol tok, ``grapes are sour''): it encodes denial-driven rationalization, yet naive models latch onto the literal fox-and-grape imagery. Addressing this oversight, we present ``Mediom,'' a multilingual, multimodal idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each paired with gold-standard explanations, cross-lingual translations, and carefully aligned text--image representations. We benchmark both large language models (textual reasoning) and vision-language models (figurative disambiguation) on Mediom, exposing systematic failures in metaphor comprehension. To mitigate these gaps, we propose ``HIDE,'' a Hinting-based Idiom Explanation framework that leverages error-feedback retrieval and targeted diagnostic cues for iterative reasoning refinement. Collectively, Mediom and HIDE establish a rigorous test bed and methodology for culturally grounded, multimodal idiom understanding embedded with reasoning hints in next-generation AI systems.