"Intelegi Româneşte?'' A Recipe for Romanian Vision-Language Models

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the significant performance degradation of existing vision-language models on low-resource languages such as Romanian, primarily due to the scarcity of large-scale image-text datasets and culturally appropriate evaluation benchmarks. The study presents the first end-to-end training and evaluation framework tailored for Romanian vision-language modeling. It leverages machine translation to generate visually aligned, culturally contextualized image-text pairs, integrates OCR-extracted text, employs a multi-scale visual backbone, and fine-tunes a Romanian-adapted language model, further enhanced with OCR-style image-text data. The resulting localized evaluation benchmark, HoraVQA, along with the custom model, consistently outperforms same-sized general-purpose models across all benchmarks—and even surpasses larger-scale counterparts—demonstrating the effectiveness and novelty of the proposed approach.

📝 Abstract

Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of building a language-specific VLM for Romanian, covering the full pipeline from data construction to architectural choices. We translate established English VLM training and evaluation corpora into Romanian, applying machine translation to textual annotations and to in-image text, preserving visual grounding while adapting the textual content. Using this data, we train and ablate a series of VLMs to isolate the contribution of (i) vision backbones of varying scale and pretraining, (ii) language backbones from multilingual to Romanian-adapted LLMs, and (iii) OCR-style image-text data. We further curate HoraVQA, a culturally native evaluation set grounded in Romanian everyday scenes. Romanian-adapted VLMs consistently outperform their same-sized counterparts and, across all evaluated benchmarks, even surpass models from the next larger size category.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

low-resource languages

Romanian

image-text corpora

culturally grounded evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Low-resource Languages

Machine Translation