Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
NLG evaluation faces dual challenges: poor reproducibility of human evaluation and prompt sensitivity of LLM-based automated assessment. To address these, we propose a reverse-learning framework that automatically constructs model-specific evaluation prompts by inverting from model-generated outputs to optimal instructions. Our approach introduces the first single-shot “output→instruction” inverse mapping mechanism, eliminating manual prompt tuning and substantially enhancing prompt robustness and task adaptability. Evaluated across multiple NLG benchmarks, our method outperforms both handcrafted and state-of-the-art automated prompting methods—achieving +12.3% improvement in evaluation consistency (Kendall’s τ) and +9.7% in correlation with human judgments (Pearson). Moreover, it accelerates evaluation throughput by over 10×. This work establishes a new paradigm for LLM-based NLG evaluation that is reproducible, personalized, and low-resource-dependent.

Technology Category

Application Category

📝 Abstract
Evaluating natural language generation (NLG) systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardisation, and demographic biases, limiting reproducibility. LLM-based evaluation offers a scalable alternative but is highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.
Problem

Research questions and friction points this paper is trying to address.

Diverse valid outputs challenge NLG system evaluation
Human evaluation lacks consistency and reproducibility
LLM-based evaluation is sensitive to prompt design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inversion learning for reverse mappings
Automatic generation of model-specific prompts
Single evaluation sample eliminates manual engineering
🔎 Similar Papers
No similar papers found.