π€ AI Summary
Existing image similarity metrics such as LPIPS and CLIP often fail to align with human subjective judgments in text-to-image generation tasks, particularly in personalized or context-sensitive scenarios. This work proposes CLPIPS, which uniquely leverages user-provided ranking feedback on generated images to fine-tune the layer combination weights of LPIPS through a lightweight adaptation. By optimizing these weights using a margin-based ranking loss on human-annotated data, CLPIPS achieves personalized alignment with perceptual similarity. Consistency with human judgments is evaluated using Spearmanβs rank correlation coefficient and intraclass correlation coefficient. Experimental results demonstrate that CLPIPS significantly outperforms the original LPIPS in capturing user preferences, thereby validating the efficacy of lightweight, personalized fine-tuning for perceptual similarity assessment.
π Abstract
Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric's notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.