Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the reference-free evaluation challenge in multimodal presentation generation. Methodologically, it introduces (1) the first reference-free evaluation paradigm based on negative-sample perturbation, generating discriminative signals via controlled semantic, structural, and visual perturbations; (2) RefSlides, the first high-quality, human-annotated benchmark for presentation evaluation; and (3) a decoupled multi-metric modeling framework coupled with LLM instruction tuning to enable fine-grained scoring of intrinsic attributes—such as summarization capability and concept conveyance—and generation of interpretable natural-language feedback. Experiments demonstrate that our approach significantly outperforms heuristic baselines and state-of-the-art LLM-based evaluators in both automated and human evaluations: scoring consistency improves by 21.3%, and feedback usefulness increases by 34.7%.

Technology Category

Application Category

📝 Abstract

The generation of presentation slides automatically is an important problem in the era of generative AI. This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience. We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics. Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics. We achieve this by generating negative presentation samples with different degrees of metric-specific perturbations and use them to fine-tune LLMs. This reference-free evaluation technique does not require ground truth presentations during inference. Our extensive automated and human experiments demonstrate that our evaluation approach outperforms classical heuristic-based and state-of-the-art large language model-based evaluations in generating scores and explanations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal content in presentation slides effectively

Generating actionable feedback for presentation content quality

Developing reference-free evaluation without ground truth data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates negative samples for LLM fine-tuning

Reference-free evaluation without ground truth

Actionable feedback via metric-specific perturbations

🔎 Similar Papers

No similar papers found.

Authors to Follow