🤖 AI Summary
This study addresses the prediction of human-rated plausibility scores (on a 1–5 scale) for different senses of polysemous words within specific contexts in short narratives. The authors systematically compare sentence embedding–based regression, parameter-efficient fine-tuning (e.g., LoRA) of Transformer models, and large language models (LLMs) enhanced with structured prompts and explicit decision rules. Innovatively, each narrative is decomposed into three segments—preceding context, target sentence, and ending—and inference rules are introduced to calibrate plausibility ratings. Experimental results demonstrate that the approach combining structured prompting with decision rules significantly outperforms both fine-tuned and embedding-based methods. Moreover, prompt design exerts a greater influence on performance than model scale, underscoring the critical role of prompt engineering in evaluating word sense plausibility.
📝 Abstract
Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1--5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules substantially outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task. The code is publicly available at https://github.com/tongwu17/SemEval-2026-Task5.