🤖 AI Summary
This work addresses the limited capability of multimodal large language models (MLLMs) in understanding fine-grained visual prompts. To this end, we propose a novel vision-prompt-driven interactive paradigm. Methodologically, we design SPHINX-V—a fully end-to-end trainable MLLM architecture integrating a vision encoder, a vision prompt encoder, and a large language model—capable of processing pixel-level visual instructions including points, bounding boxes, and free-form sketches. We further introduce MDVP-Data, the first large-scale, multi-domain vision prompt dataset (1.6M samples), and MDVP-Bench, a challenging benchmark for fine-grained visual prompting. Extensive experiments demonstrate substantial improvements over state-of-the-art MLLMs on pixel-level captioning and visual question answering across diverse inputs—including natural images, documents, OCR scenes, mobile/web screenshots, and multi-panel figures. Our approach achieves, for the first time, general-purpose, trainable, and fine-grained visual prompt understanding.
📝 Abstract
The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions. Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed pixel-level description and question-answering abilities.