ROSA: Addressing text understanding challenges in photographs via ROtated SAmpling

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models for visual question answering (VQA) suffer significant performance degradation on images captured by visually impaired users—characterized by skewed text orientation and off-center composition—due to the predominant reliance of mainstream benchmarks on upright, well-framed text images, which inadequately reflect real-world accessibility scenarios. Method: This paper introduces the first cognition-aware framework grounded in empirical analysis of visually impaired users’ image-capturing behaviors. We propose a lightweight rotation-aware sampling and decoding strategy that robustly models text orientation without architectural modification. Specifically, it performs multi-angle image sampling followed by fused decoding to enhance comprehension of arbitrarily oriented text. Contribution/Results: On text-dense images, our method achieves an absolute accuracy gain of 11.7 percentage points over standard greedy decoding. It substantially improves VQA’s practical utility and generalization capability in accessibility-critical settings, establishing a new direction for inclusive multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Visually impaired people could benefit from Visual Question Answering (VQA) systems to interpret text in their surroundings. However, current models often struggle with recognizing text in the photos taken by this population. Through in-depth interviews with visually impaired individuals, we identified common framing conventions that frequently result in misaligned text. Existing VQA benchmarks primarily feature well-oriented text captured by sighted users, under-representing these challenges. To address this gap, we introduce ROtated SAmpling (ROSA), a decoding strategy that enhances VQA performance in text-rich images with incorrectly oriented text. ROSA outperforms Greedy decoding by 11.7 absolute points in the best-performing model.
Problem

Research questions and friction points this paper is trying to address.

Improving text recognition in photos taken by visually impaired users
Addressing misaligned text challenges in Visual Question Answering systems
Enhancing VQA performance for rotated or poorly oriented text
Innovation

Methods, ideas, or system contributions that make the work stand out.

ROtated SAmpling for text alignment
Enhances VQA in misaligned text
Improves decoding by 11.7 points
🔎 Similar Papers
No similar papers found.
H
Hernán Maina
FAMAF, Universidad Nacional de Córdoba, CONICET, Argentina
Guido Ivetta
Guido Ivetta
Universidad Nacional de Córdoba, Argentina / Fundación Vía Libre
CalibrationBias in LLMs
M
Mateo Lione Stuto
FAMAF, Universidad Nacional de Córdoba, Argentina
J
J. Eisenschlos
FAMAF, Universidad Nacional de Córdoba, Argentina
J
Jorge S'anchez
Mercado Libre Inc., Argentina
Luciana Benotti
Luciana Benotti
Universidad Nacional de Cordoba, Argentina
Natural Language ProcessingEthicsConversational AgentsLanguage ModelsEducation