Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an interpretable visual word sense disambiguation (VWSD) framework to address lexical ambiguity in natural language. Leveraging the CLIP model, the approach maps ambiguous text and candidate images into a shared multimodal embedding space. It introduces a novel dual-channel textual prompt—combining semantic descriptions with photorealistic cues—and integrates WordNet synonyms to enhance semantic precision. Additionally, a test-time image augmentation strategy is employed to refine embedding representations. Disambiguation is performed by selecting the image with the highest cosine similarity to the textual input. Evaluated on the SemEval-2023 VWSD dataset, the method achieves a mean reciprocal rank (MRR) of 0.7590 (+3.63%) and a hit rate of 0.6220 (+4.10%), demonstrating the effectiveness, efficiency, and low-latency performance of the proposed precise prompting mechanism.

Technology Category

Application Category

📝 Abstract
Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.
Problem

Research questions and friction points this paper is trying to address.

Visual Word Sense Disambiguation
Lexical Ambiguity
Multimodal Representation
CLIP
Text-Image Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Word Sense Disambiguation
CLIP
Dual-Channel Prompting
Image Augmentation
Multimodal Embedding
🔎 Similar Papers
No similar papers found.
Shamik Bhattacharya
Shamik Bhattacharya
PhD Student
Climate ChangeEarth System Models (ESMs)Environmental Science
D
Daniel Perkins
The Bredesen Center for Interdisciplinary Research and Graduate Education, Knoxville, TN 37916
Y
Yaren Dogan
Department of Electrical Engineering & Computer Science, University of Tennessee, Knoxville, TN 37916
V
Vineeth Konjeti
Department of Electrical Engineering & Computer Science, University of Tennessee, Knoxville, TN 37916
S
Sudarshan Srinivasan
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN 37830
Edmon Begoli
Edmon Begoli
Oak Ridge National Laboratory (ORNL)
ai securitytext analysis