🤖 AI Summary
Current speech translation evaluation often relies on context-agnostic aggregate metrics that fail to capture user-perceived usability in real-world interactions. This work proposes Ouvia, a framework that leverages a custom web application to collect over 1,750 English–Portuguese one-on-one speech translation interactions in authentic medical and everyday scenarios, enabling the first systematic assessment of perceived usability in natural conversational settings. The study evaluates four speech translation systems across diverse dialects and gender groups and introduces a question-answering–based evaluation protocol. Results reveal that only about half of the interactions are deemed usable by users, with significant variability across demographic groups. Moreover, the proposed question-answering assessment demonstrates substantially higher predictive accuracy for real-world usability than conventional automatic metrics.
📝 Abstract
Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.