🤖 AI Summary
This study investigates the efficacy and limitations of large language models (LLMs) in supporting FrameNet-style semantic annotation for linguistic resource construction. We conduct systematic experiments comparing three annotation paradigms—manual, fully automatic (LLM-only), and semi-automatic (LLM-generated initial annotations followed by human verification)—along three dimensions: annotation efficiency, semantic frame coverage, and frame diversity. To our knowledge, this is the first evaluation of LLM-assisted semantic role labeling within a perspectivist NLP framework. Results show that the semi-automatic approach achieves coverage statistically equivalent to manual annotation (p > 0.95), increases frame diversity by 18.3%, and reduces annotation time by 42%. In contrast, the fully automatic method, while fastest, incurs a 37% accuracy drop, with errors concentrated in metaphorical and peripheral frames. These findings underscore the irreplaceable role of human–AI collaboration in high-quality semantic resource development and provide a reproducible methodology and empirical benchmark for LLM-augmented language engineering.
📝 Abstract
The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and dataset is a reality. Nonetheless, despite the potential of such tools for linguistic research, comprehensive evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage and diversity in three experimental settings: manual, automatic and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time.