Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

📅 2024-10-24
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In open-ended text generation, automatic evaluation metrics—such as coherence, diversity, and perplexity—exhibit inherent trade-offs, undermining reliable model and decoding strategy ranking. To address this, we propose the first multi-criteria evaluation framework grounded in partial order relations: it formally models metric incomparability via partial orders, introduces a differentiable composite metric for dynamic multi-objective balancing, and conducts systematic validation across mainstream large language models and standard decoding strategies. Our framework significantly enhances assessment stability and interpretability, enabling fine-grained comparison of decoding strategies and principled model selection. All code, datasets, and pre-trained models are publicly released.

Technology Category

Application Category

📝 Abstract
Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging because of trade-offs among widely used metrics such as coherence, diversity, and perplexity. Decoding methods often excel in some metrics while underperforming in others, complicating the establishment of a clear ranking. In this paper, we present novel ranking strategies within this multicriteria framework. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric designed to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Our experiments demonstrate that the proposed methods offer a robust way to compare decoding strategies, and serve as valuable tools in guiding model selection for open-ended text generation tasks. Finally, we suggest future directions for improving evaluation methodologies in text generation. Our codebase, datasets, and models are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Challenges in evaluating open-ended text generation quality
Trade-offs among coherence, diversity, and perplexity metrics
Need for holistic evaluation frameworks for decoding strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multicriteria framework for text generation evaluation
Partial orderings for benchmarking decoding strategies
New summary metric balancing automatic indicators
🔎 Similar Papers
No similar papers found.