🤖 AI Summary
NLG evaluation faces dual challenges: poor reproducibility of human evaluation and prompt sensitivity of LLM-based automated assessment. To address these, we propose a reverse-learning framework that automatically constructs model-specific evaluation prompts by inverting from model-generated outputs to optimal instructions. Our approach introduces the first single-shot “output→instruction” inverse mapping mechanism, eliminating manual prompt tuning and substantially enhancing prompt robustness and task adaptability. Evaluated across multiple NLG benchmarks, our method outperforms both handcrafted and state-of-the-art automated prompting methods—achieving +12.3% improvement in evaluation consistency (Kendall’s τ) and +9.7% in correlation with human judgments (Pearson). Moreover, it accelerates evaluation throughput by over 10×. This work establishes a new paradigm for LLM-based NLG evaluation that is reproducible, personalized, and low-resource-dependent.
📝 Abstract
Evaluating natural language generation (NLG) systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardisation, and demographic biases, limiting reproducibility. LLM-based evaluation offers a scalable alternative but is highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.