🤖 AI Summary
This work addresses the core challenge of idiom comprehension—particularly non-literal figurative meaning—in multimodal, multilingual settings. To this end, we introduce the first cross-lingual, cross-modal idiom understanding benchmark, comprising two tasks: image–idiom alignment ranking and sequential image prediction. We propose a systematic evaluation framework featuring novel multi-query fusion and a Mixture-of-Experts (MoE) mechanism to mitigate semantic biases in large language models regarding idiomatic expressions. Our approach integrates vision-language models (VLMs) and large language models (LLMs) via multi-query reasoning, expert-weighted ensemble, and semantic smoothing. Experiments demonstrate human-level performance on both tasks, with significant improvements in multilingual idiom–image semantic alignment accuracy and sequential consistency. This work establishes a new paradigm for idiom representation learning in multimodal, multilingual contexts.
📝 Abstract
Idiomatic expressions present a unique challenge in NLP, as their meanings are often not directly inferable from their constituent words. Despite recent advancements in Large Language Models (LLMs), idiomaticity remains a significant obstacle to robust semantic representation. We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models' ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models' representations of idiomaticity.