O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing large vision-language models (LVLMs) struggle to comprehend abstract hand-drawn sketches—highly simplified, semantically ambiguous visual representations. To address this, we introduce SketchVCL, the first open-vocabulary sketch-language model. Our method centers on constructing a large-scale, diverse triplet dataset (image–sketch–natural language instruction), integrating QuickDraw!, Sketchy, TU-Berlin, and our newly curated SketchVCL corpus. We jointly employ contrastive learning and instruction tuning to achieve robust cross-modal alignment between sketches, images, and language. All model weights, training data, and vocabulary are publicly released, enabling zero-shot generalization. SketchVCL achieves state-of-the-art performance across multiple sketch-driven tasks—including object localization, counting, image retrieval, and visual question answering—outperforming prior LVLMs by significant margins. This work substantially advances the understanding and reasoning over abstract visual representations.

Technology Category

Application Category

📝 Abstract

While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

Problem

Research questions and friction points this paper is trying to address.

LVLMs struggle to interpret abstract hand-drawn sketches effectively

Lack of large-scale datasets combining sketches, images, and language instructions

Need improved sketch comprehension for localization, counting, retrieval and VQA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for sketch-image-instruction triplets

Open-weight sketch-language model O3SLM training

State-of-the-art performance in sketch-based tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow