O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models (LVLMs) struggle to comprehend abstract hand-drawn sketches—highly simplified, semantically ambiguous visual representations. To address this, we introduce SketchVCL, the first open-vocabulary sketch-language model. Our method centers on constructing a large-scale, diverse triplet dataset (image–sketch–natural language instruction), integrating QuickDraw!, Sketchy, TU-Berlin, and our newly curated SketchVCL corpus. We jointly employ contrastive learning and instruction tuning to achieve robust cross-modal alignment between sketches, images, and language. All model weights, training data, and vocabulary are publicly released, enabling zero-shot generalization. SketchVCL achieves state-of-the-art performance across multiple sketch-driven tasks—including object localization, counting, image retrieval, and visual question answering—outperforming prior LVLMs by significant margins. This work substantially advances the understanding and reasoning over abstract visual representations.

Technology Category

Application Category

📝 Abstract
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
Problem

Research questions and friction points this paper is trying to address.

LVLMs struggle to interpret abstract hand-drawn sketches effectively
Lack of large-scale datasets combining sketches, images, and language instructions
Need improved sketch comprehension for localization, counting, retrieval and VQA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for sketch-image-instruction triplets
Open-weight sketch-language model O3SLM training
State-of-the-art performance in sketch-based tasks
🔎 Similar Papers
No similar papers found.
R
Rishi Gupta
Indian Institute of Science, Bangalore
M
Mukilan Karuppasamy
Indian Institute of Science, Bangalore
Shyam Marjit
Shyam Marjit
CDS, Indian Institute of Science
Computer VisionVLMsLLMs
Aditay Tripathi
Aditay Tripathi
Indian Institute of Science
Computer VisionNLPKG embeddings
A
Anirban Chakraborty
Indian Institute of Science, Bangalore