Evaluating Prompt-Based and Fine-Tuned Approaches to Czech Anaphora Resolution

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses coreference resolution in morphologically rich Czech, a low-resource NLU task. We systematically compare two paradigms: prompt engineering with black-box large language models (LLMs) and supervised fine-tuning of encoder-decoder models. Experiments are conducted on the Prague Dependency Treebank using Mistral Large 2, Llama 3, mT5, and Mistral, under a unified instruction template and consistent fine-tuning protocol. Results show that fine-tuned mT5-large achieves 88.0% accuracy—substantially outperforming the best prompting approach (74.5%)—while exhibiting lower inference latency and reduced computational overhead. To our knowledge, this is the first empirical demonstration that lightweight encoder-decoder fine-tuning offers comprehensive advantages over LLM prompting for Czech coreference resolution. The work establishes a reproducible, cost-effective methodology for coreference resolution and related NLU tasks in morphologically complex, low-resource languages.

Technology Category

Application Category

📝 Abstract
Anaphora resolution plays a critical role in natural language understanding, especially in morphologically rich languages like Czech. This paper presents a comparative evaluation of two modern approaches to anaphora resolution on Czech text: prompt engineering with large language models (LLMs) and fine-tuning compact generative models. Using a dataset derived from the Prague Dependency Treebank, we evaluate several instruction-tuned LLMs, including Mistral Large 2 and Llama 3, using a series of prompt templates. We compare them against fine-tuned variants of the mT5 and Mistral models that we trained specifically for Czech anaphora resolution. Our experiments demonstrate that while prompting yields promising few-shot results (up to 74.5% accuracy), the fine-tuned models, particularly mT5-large, outperform them significantly, achieving up to 88% accuracy while requiring fewer computational resources. We analyze performance across different anaphora types, antecedent distances, and source corpora, highlighting key strengths and trade-offs of each approach.
Problem

Research questions and friction points this paper is trying to address.

Compare prompt-based and fine-tuned models for Czech anaphora resolution
Evaluate performance of LLMs and compact models on Czech text
Analyze accuracy and resource efficiency across different anaphora types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt engineering with large language models
Fine-tuning compact generative models
Comparative evaluation on Czech anaphora resolution
🔎 Similar Papers
No similar papers found.