RealDrag: The First Dragging Benchmark with Real Target Image

📅 2025-12-13

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing drag-and-drop image editing models suffer from unreliable evaluation due to the absence of standardized benchmarks with authentic ground-truth target images and unified metrics. Method: We introduce RealDrag—the first real-world drag-based editing benchmark—comprising 400+ manually annotated video sequences, each providing source/target images, drag handles, editable region masks, and action descriptions. We propose a novel evaluation paradigm grounded in authentic target images and design four task-specific metrics (SeD, OMPS, IPPS, DiS) jointly quantifying semantic alignment, regional fidelity, and directional consistency. Evaluation is further refined across four dimensions—pixel-level matching, intra-/extra-mask fidelity, and semantic-directional similarity—using high-quality multimodal human annotations. Contribution/Results: We conduct the first systematic evaluation of 17 state-of-the-art models, uncovering inherent performance trade-offs. We publicly release the dataset, baseline implementations, and an open-source evaluation toolkit to foster reproducible research.

Technology Category

Application Category

📝 Abstract

The evaluation of drag based image editing models is unreliable due to a lack of standardized benchmarks and metrics. This ambiguity stems from inconsistent evaluation protocols and, critically, the absence of datasets containing ground truth target images, making objective comparisons between competing methods difficult. To address this, we introduce extbf{RealDrag}, the first comprehensive benchmark for point based image editing that includes paired ground truth target images. Our dataset contains over 400 human annotated samples from diverse video sources, providing source/target images, handle/target points, editable region masks, and descriptive captions for both the image and the editing action. We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS). These metrics are designed to quantify pixel level matching fidelity, check preservation of non edited (out of mask) regions, and measure semantic alignment with the desired task. Using this benchmark, we conduct the first large scale systematic analysis of the field, evaluating 17 SOTA models. Our results reveal clear trade offs among current approaches and establish a robust, reproducible baseline to guide future research. Our dataset and evaluation toolkit will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Lack of standardized benchmarks for drag-based image editing evaluation

Absence of datasets with ground truth target images for objective comparisons

Need for task-specific metrics to quantify editing fidelity and preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RealDrag benchmark with ground truth target images

Proposes four novel task-specific metrics for evaluation

Conducts large-scale systematic analysis of 17 SOTA models

🔎 Similar Papers

No similar papers found.