🤖 AI Summary
This study addresses the challenge of preserving semantic meaning when transferring culturally grounded metaphorical idioms from low-resource Southeast Asian languages—such as Hindi, Bengali, and Thai—across languages. To tackle this, the authors construct Varnika, a multimodal idiom corpus comprising 3,533 idioms annotated with seven tonal categories, and propose Hybrid-MoE, a novel framework that fuses outputs from both selected and non-selected experts while masking idiom-specific signals in multimodal embeddings to mitigate expert sparsity and enhance idiom comprehension. Furthermore, they introduce IDIO-TONE, a three-stage evaluation protocol for fine-grained assessment of literal translation fidelity, visual-semantic alignment, and meaning preservation. Experiments demonstrate that the proposed approach improves performance by 5–6% over state-of-the-art vision-language models, substantially advancing semantic representation of figurative language in multilingual multimodal settings.
📝 Abstract
In the contemporary epoch of multilingual education, learning idioms provides a fascinating gateway towards creativity, cultural values, historical context, and diverse perspectives inherent to various linguistic traditions. This paper showcases the navigation of retaining figurative and cultural semantics in low-resource Southeast Asian languages such as Hindi, Bengali, and Thai, where culturally rich idioms pose significant obstacles for computational modeling and cross-linguistic transfer due to their deep metaphorical complexity. To tackle such complexity, we present Varnika, a reconstructed multimodal idiom corpus comprising 3,533 multilingual idioms, enriched with seven idiomatic tones aligned with both textual and visual representations. Additionally, to infer informative idiomatic understanding, we introduce a Hybrid Mixture-of-Experts (HybridMoE) framework that embeds multiple idiomatic expert opinions while mitigating expert sparsity by integrating outputs from both selected and unselected experts through controlled hybridization, further augmented with Idiomatic Property Signals via masked multimodal embeddings. To analyze the performance across multiple dimensions, we propose the IDIO-TONE and Idiomatic Validation Score, a three-stage evaluation pipeline measuring (i) literal translation fidelity, (ii) visual-semantic alignment, and (iii) idiomatic meaning retention. Empirical evaluations highlight that HybridMoE achieves 5--6\% performance gains across advanced vision language models, demonstrating improved representation of figurative language and culturally embedded meaning in multilingual multimodal settings