ROSA: Addressing text understanding challenges in photographs via ROtated SAmpling

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models for visual question answering (VQA) suffer significant performance degradation on images captured by visually impaired users—characterized by skewed text orientation and off-center composition—due to the predominant reliance of mainstream benchmarks on upright, well-framed text images, which inadequately reflect real-world accessibility scenarios. Method: This paper introduces the first cognition-aware framework grounded in empirical analysis of visually impaired users’ image-capturing behaviors. We propose a lightweight rotation-aware sampling and decoding strategy that robustly models text orientation without architectural modification. Specifically, it performs multi-angle image sampling followed by fused decoding to enhance comprehension of arbitrarily oriented text. Contribution/Results: On text-dense images, our method achieves an absolute accuracy gain of 11.7 percentage points over standard greedy decoding. It substantially improves VQA’s practical utility and generalization capability in accessibility-critical settings, establishing a new direction for inclusive multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Visually impaired people could benefit from Visual Question Answering (VQA) systems to interpret text in their surroundings. However, current models often struggle with recognizing text in the photos taken by this population. Through in-depth interviews with visually impaired individuals, we identified common framing conventions that frequently result in misaligned text. Existing VQA benchmarks primarily feature well-oriented text captured by sighted users, under-representing these challenges. To address this gap, we introduce ROtated SAmpling (ROSA), a decoding strategy that enhances VQA performance in text-rich images with incorrectly oriented text. ROSA outperforms Greedy decoding by 11.7 absolute points in the best-performing model.

Problem

Research questions and friction points this paper is trying to address.

Improving text recognition in photos taken by visually impaired users

Addressing misaligned text challenges in Visual Question Answering systems

Enhancing VQA performance for rotated or poorly oriented text

Innovation

Methods, ideas, or system contributions that make the work stand out.

ROtated SAmpling for text alignment

Enhances VQA in misaligned text

Improves decoding by 11.7 points

🔎 Similar Papers

No similar papers found.

Authors to Follow