Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In open-ended text generation, automatic evaluation metrics—such as coherence, diversity, and perplexity—exhibit inherent trade-offs, undermining reliable model and decoding strategy ranking. To address this, we propose the first multi-criteria evaluation framework grounded in partial order relations: it formally models metric incomparability via partial orders, introduces a differentiable composite metric for dynamic multi-objective balancing, and conducts systematic validation across mainstream large language models and standard decoding strategies. Our framework significantly enhances assessment stability and interpretability, enabling fine-grained comparison of decoding strategies and principled model selection. All code, datasets, and pre-trained models are publicly released.

Technology Category

Application Category

📝 Abstract

Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging because of trade-offs among widely used metrics such as coherence, diversity, and perplexity. Decoding methods often excel in some metrics while underperforming in others, complicating the establishment of a clear ranking. In this paper, we present novel ranking strategies within this multicriteria framework. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric designed to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Our experiments demonstrate that the proposed methods offer a robust way to compare decoding strategies, and serve as valuable tools in guiding model selection for open-ended text generation tasks. Finally, we suggest future directions for improving evaluation methodologies in text generation. Our codebase, datasets, and models are publicly available.

Problem

Research questions and friction points this paper is trying to address.

Challenges in evaluating open-ended text generation quality

Trade-offs among coherence, diversity, and perplexity metrics

Need for holistic evaluation frameworks for decoding strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multicriteria framework for text generation evaluation

Partial orderings for benchmarking decoding strategies

New summary metric balancing automatic indicators

🔎 Similar Papers

No similar papers found.

Authors to Follow