Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses enterprise-level Visual Document Understanding (VDU), a challenging task involving information extraction and reasoning across heterogeneous visual document formats—including tables, charts, diagrams, infographics, and sketches. We propose the first decoder-only vision-aligned architecture for VDU. Our method integrates a 2B-parameter Granite large language model with a lightweight vision encoder, employs document-specialized instruction fine-tuning, and introduces a novel test-time safety classification mechanism based on sparse attention vectors—ensuring both model efficiency (<3B parameters) and inference robustness. Evaluated on standard VDU benchmarks and the contamination-resistant LiveXiv benchmark, our approach achieves state-of-the-art performance. All model weights, training data, and implementation details are publicly released under the Apache-2.0 license.

Technology Category

Application Category

📝 Abstract
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.
Problem

Research questions and friction points this paper is trying to address.

Develops lightweight multimodal model
Enhances enterprise visual document understanding
Ensures safety in model inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight multimodal model
visual document understanding
safety classification approach
🔎 Similar Papers
No similar papers found.
G
Granite Vision Team Leonid Karlinsky
IBM Research
A
Assaf Arbelle
IBM Research
A
Abraham Daniels
IBM Research
A
Ahmed Nassar
IBM Research
A
Amit Alfassi
IBM Research
B
Bo Wu
IBM Research
E
Eli Schwartz
IBM Research
Dhiraj Joshi
Dhiraj Joshi
IBM T. J. Watson Research
Artificial IntelligenceMachine LearningData MiningComputer VisionMultimedia
J
Jovana Kondic
IBM Research
N
Nimrod Shabtay
IBM Research
P
Pengyuan Li
IBM Research
Roei Herzig
Roei Herzig
MIT-IBM Lab | BAIR, UC Berkeley
Computer VisionMachine LearningRoboticsArtificial Intelligence
S
Shafiq Abedin
IBM Research
S
Shaked Perek
IBM Research
S
Sivan Harary
IBM Research
U
Udi Barzelay
IBM Research
A
Adi Raz Goldfarb
IBM Research
Aude Oliva
Aude Oliva
Senior Research Scientist, CSAIL, MIT Director MIT-IBM Lab, MIT College Director Industry
Human and machine intelligence
B
Ben Wieles
IBM Research
B
Bishwaranjan Bhattacharjee
IBM Research
B
Brandon Huang
IBM Research
Christoph Auer
Christoph Auer
IBM Research
Dan Gutfreund
Dan Gutfreund
IBM Research
D
D. Beymer
IBM Research
D
David Wood
IBM Research
H
Hildegard Kuehne
IBM Research
J
Jacob Hansen
IBM Research
J
J. Shtok
IBM Research
K
Ken Wong
IBM Research
Luis Angel D. Bathen
Luis Angel D. Bathen
IBM Research
Computer ArchitectureSecurityCloud ComputingHealthcareIoT
M
Mayank Mishra
IBM Research
Maksym Lysak
Maksym Lysak
IBM
artificial intelligencecomputer vision3d graphicsartshistory
Michele Dolfi
Michele Dolfi
IBM Research
Knowledge ingestionCloud computingComputational physicsTensor networksHigh performance computing
Mikhail Yurochkin
Mikhail Yurochkin
Staff AI Scientist, IFM MBZUAI, ex MIT-IBM Watson AI Lab
Machine LearningFoundation ModelsEvaluationModel Fusion
Nikolaos Livathinos
Nikolaos Livathinos
IBM Research
Computer VisionAISoftware Architecture
Nimrod Harel
Nimrod Harel
PhD Student @ Tel Aviv University
Machine LearningAIDeep LearningExplainability
O
Ophir Azulai
IBM Research
O
O. Naparstek
IBM Research
R
Rafael Teixeira de Lima
IBM Research
Rameswar Panda
Rameswar Panda
Distinguished Engineer, IBM Research
Computer VisionMachine LearningNatural Language Processing
Sivan Doveh
Sivan Doveh
Weizmann Institute of Science; Google
S
Shubham Gupta
IBM Research
Subhro Das
Subhro Das
Principal AI Researcher, Applied Science, Microsoft
Representation LearningFoundation ModelsReinforcement LearningLLM
Syed Zawad
Syed Zawad
Research Scientist, IBM
Machine LearningDistributed SystemsCloud ComputingFederated Learning
Y
Yusik Kim
IBM Research
Zexue He
Zexue He
University of California, San Diego
Trustworthy NLPLLM
A
Alexander Brooks
IBM Research
G
Gabe Goodhart
IBM Research
A
A. Govindjee
IBM Research
D
Derek Leist
IBM Research
I
Ibrahim Ibrahim
IBM Research
A
A. Soffer
IBM Research
David Cox
David Cox
VP, AI Models; IBM Director, MIT-IBM Watson AI Lab, IBM Research
Artificial IntelligenceGenerative AILarge Language Models
K
Kate Soule
IBM Research
L
Luis A. Lastras
IBM Research
N
Nirmit Desai
IBM Research
S
Shila Ofek-Koifman
IBM Research
S
Sriram Raghavan
IBM Research
T
T. Syeda-Mahmood
IBM Research
P
Peter W. J. Staar
IBM Research
T
Tal Drory
IBM Research
R
Rogério Feris
IBM Research