From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of vision-language models (VLMs) to perceive and reason about spatial relationships among unseen microscopic entities—particularly molecules—despite their growing role in scientific AI. Method: We introduce Microscopic Spatial Intelligence (MiSI), a novel paradigm, and propose MiSI-Bench—the first molecular-scale vision-language benchmark—comprising 163K question-answer pairs and 587K multi-view rendered images, covering nine scientific spatial reasoning tasks. Our approach uniquely generates physically plausible 3D-molecule-based images, integrates spatial relation annotations, encodes scientific constraints, and performs VLM fine-tuning. Contribution/Results: Experiments reveal that state-of-the-art VLMs substantially underperform humans overall. A fine-tuned 7B VLM surpasses human accuracy on spatial transformation tasks but lags significantly on knowledge-intensive tasks like hydrogen-bond identification, empirically validating the necessity and efficacy of domain-knowledge enhancement for microscopic spatial intelligence.

Technology Category

Application Category

📝 Abstract
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking VLMs on microscopic spatial intelligence tasks
Assessing VLMs' ability to perceive molecular spatial relationships
Evaluating VLMs on complex scientific reasoning like hydrogen bonds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark framework for microscopic spatial intelligence evaluation
Fine-tuned 7B model excels in spatial transformation tasks
Integration of domain knowledge needed for scientific tasks
🔎 Similar Papers
No similar papers found.
Z
Zongzhao Li
Gaoling School of Artificial Intelligence, Renmin University of China
Xiangzhe Kong
Xiangzhe Kong
Tsinghua University
NLPGNNAIDDAI4Science
J
Jiahui Su
SKL-ESPC & SEPKL-AERM, College of Environmental Sciences and Engineering, Peking University
Z
Zongyang Ma
MAIS, Institute of Automation, Chinese Academy of Sciences
M
Mingze Li
Gaoling School of Artificial Intelligence, Renmin University of China
Songyou Li
Songyou Li
Gaoling school of artificial intelligence, Renmin University of China
AI for Science
Y
Yuelin Zhang
Gaoling School of Artificial Intelligence, Renmin University of China
Y
Yu Rong
DAMO Academy, Alibaba Group, Hangzhou, China
Tingyang Xu
Tingyang Xu
Alibaba DAMO Academy
Machine LearningDeep Graph LearningDrug Discovery
Deli Zhao
Deli Zhao
Alibaba DAMO Academy
generative modelsmultimodal learningfoundation models
Wenbing Huang
Wenbing Huang
Associate Professor, Renmin University of China
Machine LearningAI for Science