Multi-group Uncertainty Quantification for Long-form Text Generation

📅 2024-07-25
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the failure of uncertainty quantification for subpopulations (e.g., texts associated with distinct demographic attributes) in long-text generation. We propose the first dual-uncertainty framework that jointly ensures individual statement calibration and global output conformal prediction. Innovatively, we introduce multicalibration and multivalid conformal prediction—two rigorous statistical notions—to long-text generation, achieving simultaneous marginal and subgroup-level statistical guarantees for the first time. We further construct the first empirical benchmark specifically designed for uncertainty quantification in long-text generation. Using biography generation as a testbed, our method integrates probabilistic calibration with conformal prediction techniques. Experiments demonstrate significant improvements in both global and subgroup-level uncertainty calibration. Moreover, incorporating prompt-group attributes synergistically enhances both aggregate and subgroup-wise trustworthiness assessment.

Technology Category

Application Category

📝 Abstract
While large language models are rapidly moving towards consumer-facing applications, they are often still prone to factual errors and hallucinations. In order to reduce the potential harms that may come from these errors, it is important for users to know to what extent they can trust an LLM when it makes a factual claim. To this end, we study the problem of uncertainty quantification of factual correctness in long-form natural language generation. Given some output from a large language model, we study both uncertainty at the level of individual claims contained within the output (via calibration) and uncertainty across the entire output itself (via conformal prediction). Moreover, we invoke multicalibration and multivalid conformal prediction to ensure that such uncertainty guarantees are valid both marginally and across distinct groups of prompts. Using the task of biography generation, we demonstrate empirically that having access to and making use of additional group attributes for each prompt improves both overall and group-wise performance. As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored previously in the context of long-form text generation, we consider these empirical results to form a benchmark for this setting.
Problem

Research questions and friction points this paper is trying to address.

Quantify uncertainty in LLM outputs for subgroups
Evaluate claim-level and output-wide uncertainty in text
Improve subgroup calibration using demographic attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses calibration for individual claim uncertainty
Applies conformal prediction for overall output uncertainty
Employs multicalibration for subgroup uncertainty quantification
🔎 Similar Papers
No similar papers found.