Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing virtual try-on systems lack a no-reference, single-image quality assessment method that aligns with human perception. To address this gap, this work proposes VTON-IQA, a novel framework that first introduces VTON-QBench—the largest human-annotated benchmark to date for image quality evaluation across 14 state-of-the-art virtual try-on models—and then designs a Transformer-based interleaved cross-attention mechanism to explicitly model the interaction between garment fidelity and preservation of human details. Experiments demonstrate that VTON-IQA achieves highly consistent image-level quality predictions with human judgments under no-reference conditions, establishing the first generalizable and perceptually aligned evaluation standard for virtual try-on models.

Technology Category

Application Category

📝 Abstract
Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fréchet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.
Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On
Image Quality Assessment
Reference-Free Evaluation
Human Perception
No Ground Truth
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-Free IQA
Virtual Try-On
Human Feedback
Interleaved Cross-Attention
VTON-QBench
🔎 Similar Papers
No similar papers found.
Y
Yuki Hirakawa
ZOZO Research, Keio University
T
Takashi Wada
ZOZO Research
R
Ryotaro Shimizu
ZOZO Research
T
Takuya Furusawa
ZOZO Research
Yuki Saito
Yuki Saito
Lecturer (Sr. Assistant Professor), The University of Tokyo
Speech synthesisVoice conversionMachine learning
R
Ryosuke Araki
ZOZO Inc.
T
Tianwei Chen
ZOZO Research
F
Fan Mo
ZOZO Research
Yoshimitsu Aoki
Yoshimitsu Aoki
慶應義塾大学
コンピュータビジョン,パターン認識