Better Benchmarking LLMs for Zero-Shot Dependency Parsing

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work systematically evaluates the true zero-shot dependency parsing capability of open-source large language models (LLMs). Addressing the lack of strong, input-agnostic baselines in existing evaluations, we introduce novel non-input-dependent baselines—random projection trees and optimal linear arrangements—for rigorous performance comparison. Experiments span multilingual benchmarks and employ standard labeled/unlabeled attachment score (LAS/UAS) metrics under a unified zero-shot prompting protocol. Results show that the vast majority of open-source LLMs fail to surpass these baselines; only the latest and largest LLaMA variant achieves marginal gains across most languages, yet its absolute accuracy remains far below practical utility thresholds. The study demonstrates that current open-source LLMs lack reliable zero-shot syntactic parsing ability, underscoring the critical importance of strong, information-poor baselines for realistic and trustworthy model assessment.

Technology Category

Application Category

📝 Abstract

While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs in zero-shot syntactic parsing tasks.

Compares LLMs to uninformed baselines for parsing accuracy.

Finds limited success of LLMs in zero-shot parsing.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares LLMs to uninformed baselines

Uses random projective trees

Tests optimal linear arrangements

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks