🤖 AI Summary
This work addresses the challenge of evaluating conditional generation quality in compositional extrapolation settings, where the true target distribution is unavailable. The authors propose a post-hoc, instance-wise confidence scoring mechanism that requires no access to the target distribution. By constructing estimable metrics based on data manifold compatibility and attribute contrastive distance, the method holistically assesses both global realism and attribute fidelity. Notably, it incurs no additional training and is directly applicable to off-the-shelf pre-trained generative models. To the best of our knowledge, this is the first approach enabling effective evaluation of compositional extrapolation samples, facilitating sample filtering, ranking, and pre-generation abstention. Experiments on biological imaging and visual benchmarks demonstrate substantial improvements in morphological fidelity and downstream predictive performance, along with the capability for early abstention during generation.
📝 Abstract
Conditional generators provide a natural tool for controllable generation, including settings where the desired condition is a new composition of observed attributes or experimental factors. In many applications, especially in scientific domains, such models are attractive to explore conditions for which real samples are rare, expensive, or not yet observed. However, this creates a circularity for evaluation: standard conditional quality metrics require a reference target distribution, but in the extrapolative regime that distribution is unavailable by definition. We address this problem with a post-hoc, per-sample trust score for assessing conditional samples using only the training distribution. The score combines two estimable quantities: global realism, measuring compatibility with the real data manifold, and attribute-wise faithfulness, measuring whether a sample is closer to the requested attributes than to plausible alternatives. We show that the score can recover meaningful comparisons across extrapolated generations, under a mild coverage condition on the observed attributes. These comparisons enable effective filtering, ranking, and abstention of generations and can be used directly on off-the-shelf pretrained models. In biological imaging, selected samples preserve real morphological structure better and improve downstream predictive performance, while similar gains are observed on controlled vision benchmarks. Finally, we show how the score can be applied during generation, enabling abstention before full decoding. Code is available at https://github.com/berkerdemirel/faithful-cond-gen.