PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

📅 2024-06-20
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Current large multimodal models (LMMs) suffer from significant perceptual biases and logical errors in complex visual understanding and cross-modal reasoning. To address this, we propose PIN format—a structured, knowledge-intensive interleaved representation of text and aligned images encoded in Markdown. Leveraging this format, we construct PIN-14M, the first high-quality, bilingual (Chinese–English) multimodal document dataset comprising 14 million samples, drawn from scientific literature and web content. PIN-14M is built via multi-source crawling, semantic cleaning, cross-lingual alignment, and rigorous quality auditing, ensuring ethical compliance and robustness while supporting diverse training paradigms. Empirical evaluation demonstrates that LMMs trained on PIN-14M substantially mitigate perceptual biases and achieve superior accuracy in complex document understanding. Our work establishes a scalable, information-dense training paradigm for foundational multimodal models.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.
Problem

Research questions and friction points this paper is trying to address.

Addresses perceptual and reasoning errors in multimodal models
Enhances interpretation of complex visual and textual data
Improves multimodal relationship deduction in knowledge tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines Markdown files with document images
Introduces scalable datasets PIN-200M and PIN-14M
Provides quality signals for task-specific filtering
🔎 Similar Papers
No similar papers found.
J
Junjie Wang
Multimodal Art Projection
Y
Yin Zhang
Multimodal Art Projection
Y
Yatai Ji
Tsinghua University
Y
Yuxiang Zhang
Multimodal Art Projection, Waseda University
Chunyang Jiang
Chunyang Jiang
HKUST
Artificial IntelligenceNatural Language Processing
Y
Yubo Wang
University of Waterloo
K
Kang Zhu
01.AI
Z
Zekun Wang
01.AI
T
Tiezhen Wang
Hugging Face
W
Wenhao Huang
01.AI
J
Jie Fu
Independent Researcher
B
Bei Chen
01.AI
Qunshu Lin
Qunshu Lin
Co-Founder of Abaka.AI
Data-Centric AI
M
Minghao Liu
22077AI
G
Ge Zhang
Multimodal Art Projection, University of Waterloo, 01.AI
Wenhu Chen
Wenhu Chen
Assistant Professor at University of Waterloo
Natural Language ProcessingArtificial IntelligenceDeep Learning