ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing vision-language models (VLMs) lack explicit, interpretable multimodal reasoning capabilities. Method: We introduce MMR-250K, the first large-scale, structured multimodal reasoning dataset, constructed from 250K ImageNet21k images. Leveraging a novel dual-model collaborative generation framework—GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506—we simultaneously produce fine-grained chain-of-thought (CoT) traces and final answers, jointly modeling visual understanding and stepwise reasoning. Contribution/Results: MMR-250K is the first dataset enabling explicit, traceable multimodal reasoning training and is accompanied by a standardized evaluation benchmark. Experiments demonstrate substantial improvements in VLMs’ performance and interpretability on complex reasoning tasks. This work establishes critical infrastructure for advancing both the theoretical understanding and practical development of multimodal reasoning mechanisms.

Technology Category

Application Category

📝 Abstract

We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.

Problem

Research questions and friction points this paper is trying to address.

Develops multimodal reasoning dataset for Vision Language Models

Provides structured thinking tokens and descriptive answers

Enables training and evaluation of robust reasoning VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset with structured thinking tokens

Generated by two advanced Vision Language Models

Includes step-by-step reasoning and final answers

🔎 Similar Papers

No similar papers found.