🤖 AI Summary
This work addresses the challenge of quantifying semantic abstraction efficiency in sketch evaluation, a task poorly handled by existing methods that rely on reference images or low-level features. To this end, we propose SEA, a reference-free automatic metric that leverages commonsense knowledge to define key visual elements for each object category and employs a visual question answering model to assess their presence in sketches, thereby measuring semantic retention under visual sparsity. We introduce CommonSketch, the first element-annotated sketch dataset comprising 23,100 sketches across 300 categories, and establish a novel evaluation framework grounded in commonsense reasoning and element-level semantic preservation. Experimental results demonstrate that SEA correlates strongly with human judgments and effectively discriminates abstraction efficiency across sketches, offering a new benchmark for fine-grained multimodal understanding.
📝 Abstract
A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.