🤖 AI Summary
Existing drag-and-drop image editing models suffer from unreliable evaluation due to the absence of standardized benchmarks with authentic ground-truth target images and unified metrics.
Method: We introduce RealDrag—the first real-world drag-based editing benchmark—comprising 400+ manually annotated video sequences, each providing source/target images, drag handles, editable region masks, and action descriptions. We propose a novel evaluation paradigm grounded in authentic target images and design four task-specific metrics (SeD, OMPS, IPPS, DiS) jointly quantifying semantic alignment, regional fidelity, and directional consistency. Evaluation is further refined across four dimensions—pixel-level matching, intra-/extra-mask fidelity, and semantic-directional similarity—using high-quality multimodal human annotations.
Contribution/Results: We conduct the first systematic evaluation of 17 state-of-the-art models, uncovering inherent performance trade-offs. We publicly release the dataset, baseline implementations, and an open-source evaluation toolkit to foster reproducible research.
📝 Abstract
The evaluation of drag based image editing models is unreliable due to a lack of standardized benchmarks and metrics. This ambiguity stems from inconsistent evaluation protocols and, critically, the absence of datasets containing ground truth target images, making objective comparisons between competing methods difficult. To address this, we introduce extbf{RealDrag}, the first comprehensive benchmark for point based image editing that includes paired ground truth target images. Our dataset contains over 400 human annotated samples from diverse video sources, providing source/target images, handle/target points, editable region masks, and descriptive captions for both the image and the editing action.
We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS). These metrics are designed to quantify pixel level matching fidelity, check preservation of non edited (out of mask) regions, and measure semantic alignment with the desired task. Using this benchmark, we conduct the first large scale systematic analysis of the field, evaluating 17 SOTA models. Our results reveal clear trade offs among current approaches and establish a robust, reproducible baseline to guide future research. Our dataset and evaluation toolkit will be made publicly available.