๐ค AI Summary
Factuality assessment in narrative understanding suffers from subjectivity, particularly when judging the fidelity of statements to source documents amid ambiguous boundaries. Method: This paper reframes binary faithfulness classification as a quantifiable ambiguity measurement, introducing the Ambiguity Rewriting Measure (ARM). ARM leverages large language models to generate controlled summary editing sequences, quantifying claim ambiguity via rewriting magnitude rather than binary labels. The approach integrates controlled summarization editing, narrative consistency modeling, and a subjective-quantification evaluation framework. Results: On narrative summarization tasks, ARM improves inter-annotator agreement by 21 percentage points, substantially mitigating unreliability in factuality evaluation caused by divergent subjective interpretations. It establishes the first generative rewritingโbased paradigm for ambiguity quantification in factuality assessment.
๐ Abstract
Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.