PLaMo 2.1-VL Technical Report

๐Ÿ“… 2026-04-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

176K/year
๐Ÿค– AI Summary
This work addresses the limited visual question answering and grounding capabilities of lightweight vision-language models (VLMs) in Japanese contexts, as well as the absence of efficient edge-deployment solutions. To this end, we propose PLaMo 2.1-VL, available in 8B and 2B parameter variants, designed for deployment on local and edge devices. Our approach integrates a lightweight multimodal architecture, a large-scale synthetic data generation pipeline, and Japanese-specific multimodal alignment techniques, supporting both zero-shot inference and domain-specific fine-tuning. We introduce the first bilingual (Japaneseโ€“English) training and evaluation benchmark for such tasks. On the JA-VG-VQA-500 dataset, our model achieves 61.5 ROUGE-L and 85.2% Ref-L4 accuracy for Japanese referring expressions. In industrial applications, it attains 53.9% zero-shot accuracy on factory tool recognition and improves F1-score from 39.7 to 64.9 after fine-tuning for power plant anomaly detection, substantially outperforming existing open-source lightweight VLMs.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.
Problem

Research questions and friction points this paper is trying to address.

Vision Language Model
Visual Question Answering
Visual Grounding
Edge Deployment
Japanese-language Operation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language Model
Lightweight VLM
Japanese-language VQA
Synthetic Data Generation
Edge Deployment