TechING: Towards Real World Technical Image Understanding via VLMs

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) struggle to interpret hand-drawn technical diagrams—such as flowcharts and block diagrams—in real-world settings, limiting their applicability in specialized domains. To address this challenge, this work proposes LLaMA-VL-TUG, a novel approach that introduces a synthetic data generation strategy tailored specifically for technical imagery and a new self-supervised pretraining objective. By fine-tuning Llama 3.2 11B-Instruct with this framework, the model achieves a 2.14× improvement in ROUGE-L on synthetic data without requiring large-scale real-world hand-drawn datasets. On authentic hand-drawn diagram benchmarks, it yields the fewest compilation errors across seven out of eight diagram types and demonstrates a 6.97× average increase in F1 score, significantly advancing technical diagram understanding and generation capabilities.

Technology Category

Application Category

📝 Abstract
Professionals working in technical domain typically hand-draw (on whiteboard, paper, etc.) technical diagrams (e.g., flowcharts, block diagrams, etc.) during discussions; however, if they want to edit these later, it needs to be drawn from scratch. Modern day VLMs have made tremendous progress in image understanding but they struggle when it comes to understanding technical diagrams. One way to overcome this problem is to fine-tune on real world hand-drawn images, but it is not practically possible to generate large number of such images. In this paper, we introduce a large synthetically generated corpus (reflective of real world images) for training VLMs and subsequently evaluate VLMs on a smaller corpus of hand-drawn images (with the help of humans). We introduce several new self-supervision tasks for training and perform extensive experiments with various baseline models and fine-tune Llama 3.2 11B-instruct model on synthetic images on these tasks to obtain LLama-VL-TUG, which significantly improves the ROUGE-L performance of Llama 3.2 11B-instruct by 2.14x and achieves the best all-round performance across all baseline models. On real-world images, human evaluation reveals that we achieve minimum compilation errors across all baselines in 7 out of 8 diagram types and improve the average F1 score of Llama 3.2 11B-instruct by 6.97x.
Problem

Research questions and friction points this paper is trying to address.

technical diagrams
hand-drawn images
visual language models
image understanding
real-world technical image
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation
visual language models
technical diagram understanding
self-supervised learning
hand-drawn image interpretation
🔎 Similar Papers
No similar papers found.