Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study evaluates the reliability of large language models (LLMs) in real-world consumer electronics repair scenarios, with particular emphasis on their accuracy under conditions of incomplete information, complex diagnostics, and high safety risks. To this end, we introduce RepairBench, the first multilingual benchmark tailored to authentic repair tasks, comprising 991 real-world Reddit queries spanning smartphone and computer repairs as well as data recovery, each annotated with technician-authored reference answers and Bengali translations. Human evaluations assess model responses across four dimensions: correctness, completeness, practicality, and safety. Results reveal that current LLMs frequently produce erroneous advice in motherboard-level diagnostics and safety-critical operations, rendering them overall unreliable; while GPT-5.4 demonstrates the strongest performance, all models exhibit significantly degraded accuracy in Bengali compared to English, highlighting a pronounced cross-lingual performance gap.

📝 Abstract

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

Problem

Research questions and friction points this paper is trying to address.

large language models

consumer device repair

safety-critical decisions

real-world benchmark

cross-lingual evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

repair benchmark

cross-lingual evaluation

safety-critical reasoning