🤖 AI Summary
This paper investigates the effectiveness of checklists in automating evaluation for generative tasks—particularly under ambiguous or ill-defined evaluation criteria. We systematically compare six checklist generation methods across eight model scales, conducting both pairwise comparison and direct scoring experiments. Results show that selective (rather than universal) application of checklists significantly improves pairwise comparison performance, whereas its impact on direct scoring is inconsistent. Notably, even low-correlation checklist items capture trends observed in human annotations, revealing latent inconsistencies in human evaluation. Our core contribution is identifying the applicability boundaries of checklists, proposing a “paradigm-differentiated activation” principle—i.e., enabling checklists only when aligned with the evaluation paradigm—and emphasizing the necessity of explicitly defining objective assessment criteria to jointly enhance consistency between human and automated evaluations. (149 words)
📝 Abstract
Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored. We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations. Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring. Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations. footnote{Our code is available at~https://github.com/momo0817/checklist-effectiveness-study