Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

NLG evaluation faces dual challenges: poor reproducibility of human evaluation and prompt sensitivity of LLM-based automated assessment. To address these, we propose a reverse-learning framework that automatically constructs model-specific evaluation prompts by inverting from model-generated outputs to optimal instructions. Our approach introduces the first single-shot “output→instruction” inverse mapping mechanism, eliminating manual prompt tuning and substantially enhancing prompt robustness and task adaptability. Evaluated across multiple NLG benchmarks, our method outperforms both handcrafted and state-of-the-art automated prompting methods—achieving +12.3% improvement in evaluation consistency (Kendall’s τ) and +9.7% in correlation with human judgments (Pearson). Moreover, it accelerates evaluation throughput by over 10×. This work establishes a new paradigm for LLM-based NLG evaluation that is reproducible, personalized, and low-resource-dependent.

Technology Category

Application Category

📝 Abstract

Evaluating natural language generation (NLG) systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardisation, and demographic biases, limiting reproducibility. LLM-based evaluation offers a scalable alternative but is highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.

Problem

Research questions and friction points this paper is trying to address.

Diverse valid outputs challenge NLG system evaluation

Human evaluation lacks consistency and reproducibility

LLM-based evaluation is sensitive to prompt design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inversion learning for reverse mappings

Automatic generation of model-specific prompts

Single evaluation sample eliminates manual engineering

🔎 Similar Papers

No similar papers found.