🤖 AI Summary
This study addresses the lack of systematic evaluation of user interface (UI) design and accessibility compliance in deployed web-based chatbots across healthcare, education, and customer service domains. Method: We conducted a large-scale empirical assessment of 106 production chatbots using a multi-tool hybrid approach—integrating automated audits (Lighthouse, SiteImprove, Accessibility Insights) with expert manual inspection. Contribution/Results: We found that over 80% exhibited critical accessibility failures, and 45% suffered from semantic HTML deficiencies or incorrect ARIA role usage—revealing for the first time the high prevalence and systemic nature of such issues in conversational UIs. Our analysis confirms the necessity of multi-tool evaluation and uncovers strong inter-tool correlation for accessibility scores (r = 0.861), contrasting with weak correlation for performance metrics (r = 0.436). These findings establish a methodological foundation and empirical evidence base for accessibility evaluation of chatbot interfaces.
📝 Abstract
In this work, we present a multi-tool evaluation of 106 deployed web-based chatbots, across domains like healthcare, education and customer service, comprising both standalone applications and embedded widgets using automated tools (Google Lighthouse, PageSpeed Insights, SiteImprove Accessibility Checker) and manual audits (Microsoft Accessibility Insights). Our analysis reveals that over 80% of chatbots exhibit at least one critical accessibility issue, and 45% suffer from missing semantic structures or ARIA role misuse. Furthermore, we found that accessibility scores correlate strongly across tools (e.g., Lighthouse vs PageSpeed Insights, r = 0.861), but performance scores do not (r = 0.436), underscoring the value of a multi-tool approach. We offer a replicable evaluation insights and actionable recommendations to support the development of user-friendly conversational interfaces.