🤖 AI Summary
To address visual-centrism bias, high annotation costs, and misalignment with real-world needs when constructing chart description datasets for blind and low-vision (BLV) users—typically annotated by sighted individuals—this paper proposes a novel “evaluation-over-generation” paradigm. Instead of direct authoring, sighted annotators perform multi-round implicit supervision via evaluation and filtering of vision-language model (VLM)-generated descriptions; initial drafts are validated for practical utility by BLV educators. Our methodology integrates multi-round implicit supervised reasoning, human evaluation–driven data curation, and five complementary annotation tasks: completion, preference ranking, retrieval, question answering, and logical reasoning. We release Sightation, a large-scale dataset comprising 5,000 charts and 137,000 high-quality samples. Experiments demonstrate substantial improvements in downstream fine-tuning performance across multiple chart-description tasks. The dataset has received strong endorsement from professional BLV educators in educational practice.
📝 Abstract
Often, the needs and visual abilities differ between the annotator group and the end user group. Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain. Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat lacking by BLV standards. In this study, we ask sighted individuals to assess -- rather than produce -- diagram descriptions generated by vision-language models (VLM) that have been guided with latent supervision via a multi-pass inference. The sighted assessments prove effective and useful to professional educators who are themselves BLV and teach visually impaired learners. We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and demonstrate their fine-tuning potential in various downstream tasks.