A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the lack of reliable and systematic evaluation of vision-language models (VLMs) in visual impairment assistance (VIA) tasks, which currently rely on costly human judgments. We introduce VIABLE, the first VLM-as-a-Judge benchmark tailored for VIA, encompassing over 300,000 samples across three real-world scenarios, along with a comprehensive evaluation framework centered on validity, fairness, and stability, and a taxonomy of 12 failure modes. Building upon this foundation, we propose VIA-Judge-Agent, a model-agnostic reasoning enhancement method that integrates visual evidence extraction with a classification-guided pipeline to significantly improve judgment reliability. Experiments reveal that even the strongest existing model, GPT-5.4, achieves only 52.6% accuracy in single-failure diagnosis, whereas VIA-Judge-Agent not only substantially boosts accuracy but also generates assistive responses preferred by blind and low-vision users.

📝 Abstract

AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: https://github.com/YiyiyiZhao/VIABLE

Problem

Research questions and friction points this paper is trying to address.

Visually Impaired Assistance

VLM-as-a-Judge

Evaluation Benchmark

Reliability

AI-based Assistance

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-as-a-Judge

Visually Impaired Assistance

Evaluation Benchmark