🤖 AI Summary
This study addresses a critical gap in the evaluation of scientific news generation, where existing metrics predominantly emphasize semantic similarity and factual consistency while neglecting actual reader knowledge acquisition. To bridge this gap, we propose KnowledgeGain—a novel evaluation metric that explicitly quantifies knowledge gain as a core dimension by employing a pretest-posttest design to measure readers’ learning outcomes. To efficiently identify high-quality articles, we develop a large language model–based reader simulator, calibrated against human subject experiments to ensure predictive validity. Empirical results from two rounds of experiments demonstrate that articles selected by our approach significantly improve readers’ post-reading question-answering accuracy and normalized knowledge gain, outperforming strong baseline generative models.
📝 Abstract
Science news is an important medium to communicate discoveries between the research communities and the public. Yet, most metrics for generated or summarized text evaluate semantic similarity and factual consistency, but do not measure how much knowledge readers learn from the news. We introduce KnowledgeGain, a metric that evaluates the quality of science news by measuring how much knowledge readers gained after reading it. To evaluate the metric, we first performed a controlled human study and showed that the metric successfully captures the differential knowledge gained by human readers reading different types of science media. The data allowed us to calibrate a prompt-only LLM reader simulator. We use it to rank and filter candidate articles before human evaluation. A second human study shows that articles selected with this simulator improve post-reading accuracy and normalized KnowledgeGain over a strong generation baseline. Our work is a step toward generating science news that better meets the knowledge and comprehension goals of Bloom's Taxonomy.