EgoBlind: Towards Egocentric Visual Assistance for the Blind People

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the suboptimal performance of multimodal large language models (MLLMs) in visual assistance for blind individuals. To this end, we introduce the first egocentric video question-answering (VideoQA) benchmark specifically designed for blind users—comprising 1,210 real-world first-person videos of daily activities by blind participants and 4,927 need-driven questions. Crucially, blind individuals co-designed the entire pipeline: data collection, question formulation, and answer validation; multiple reference answers were incorporated to mitigate subjectivity. Comprehensive evaluation across 15 state-of-the-art MLLMs reveals that the best-performing model achieves only 56.0% accuracy—substantially below human performance (87.4%)—and exposes critical weaknesses in egocentric scene understanding, spatial reasoning, and task-specific need alignment. The benchmark is fully reproducible and accompanied by diagnostic analysis, providing both a rigorous evaluation framework and actionable directions for improvement. This work establishes a foundational resource for advancing accessible AI research.

Technology Category

Application Category

📝 Abstract
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,210 videos that record the daily lives of real blind users from a first-person perspective. It also features 4,927 questions directly posed or generated and verified by blind individuals to reflect their needs for visual assistance under various scenarios. We provide each question with an average of 3 reference answers to alleviate subjective evaluation. Using EgoBlind, we comprehensively evaluate 15 leading MLLMs and find that all models struggle, with the best performers achieving accuracy around 56%, far behind human performance of 87.4%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and provide heuristic suggestions for improvement. With these efforts, we hope EgoBlind can serve as a valuable foundation for developing more effective AI assistants to enhance the independence of the blind individuals' lives.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs for blind visual assistance using egocentric VideoQA.
Assessing daily life scenarios from blind individuals' perspectives.
Identifying limitations and suggesting improvements for MLLMs in visual assistance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

First egocentric VideoQA dataset for blind individuals
Evaluates multimodal large language models' assistive capabilities
Identifies limitations and suggests improvements for MLLMs
🔎 Similar Papers
No similar papers found.
Junbin Xiao
Junbin Xiao
National University of Singapore
Video and LanguageEmbodied InteractionTrustworthy Multimodality
N
Nanxin Huang
Communication University of China
H
Hao Qiu
Communication University of China
Z
Zhulin Tao
Communication University of China
X
Xun Yang
University of Science and Technology of China
Richang Hong
Richang Hong
Hefei University of Technology
MultimediaPattern Recognition
M
Meng Wang
Hefei University of Technology
Angela Yao
Angela Yao
National University of Singapore
computer visiondeep learningmachine learning