A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing PLM semantic relation evaluation frameworks are limited to hypernymy and lack human–machine fairness comparisons. Method: We propose the first unified benchmark covering five relation types—hypernymy, hyponymy, meronymy, antonymy, and synonymy—and introduce six novel metrics (e.g., soundness) to systematically assess underexplored dimensions: symmetry, prototypicality, and discriminability. Using prompt engineering and zero-shot inference, we uniformly evaluate both masked and autoregressive PLMs across 16 models. Contribution/Results: Rigorous human–machine controlled experiments reveal that models frequently misclassify non-antonymous relations as antonymous; masked models significantly outperform autoregressive ones; and only on antonymy do models approach human performance—demonstrating a substantial semantic knowledge gap between humans and current PLMs.

Technology Category

Application Category

📝 Abstract
Recently, much work has concerned itself with the enigma of what exactly PLMs (pretrained language models) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Only one relation was considered, namely hypernymy. Furthermore, previous work did not measure humans' performance on the same task as that solved by the PLMs. This means that at this point in time, there is only an incomplete view of models' semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use six metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, asymmetry, prototypicality, and distinguishability and fairly compare humans and models on the same task. Our extensive experiments involve 16 PLMs, eight masked and eight causal language models. Up to now only masked language models had been tested although causal and masked language models treat context differently. Our results reveal a significant knowledge gap between humans and models for almost all semantic relations. Antonymy is the outlier relation where all models perform reasonably well. In general, masked language models perform significantly better than causal language models. Nonetheless, both masked and causal language models are likely to confuse non-antonymy relations with antonymy.
Problem

Research questions and friction points this paper is trying to address.

Evaluates PLMs' knowledge of five semantic relations beyond hypernymy.
Compares human and model performance on identical semantic relation tasks.
Assesses PLMs' soundness, completeness, and other untreated semantic aspects.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive evaluation framework for five semantic relations
Six metrics assessing untreated semantic relation aspects
Comparison of 16 masked and causal language models
🔎 Similar Papers
No similar papers found.
Z
Zhihan Cao
School of Computing, Institute of Science Tokyo
H
Hiroaki Yamada
School of Computing, Institute of Science Tokyo
Simone Teufel
Simone Teufel
Professor of Computer Science, Cambridge University
computational linguistics
T
Takenobu Tokunaga
School of Computing, Institute of Science Tokyo