Is Our Chatbot Telling Lies? Assessing Correctness of an LLM-based Dutch Support Chatbot

๐Ÿ“… 2024-10-29
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In low-resource Dutch-language settings, evaluating the correctness of LLM-based customer service responses faces three key challenges: absence of a formal correctness definition, scarcity of annotated data, and the need for real-time assessment. Method: Addressing these challenges within AFASโ€™s enterprise customer support system, we first establish a Dutch-specific response correctness standard grounded in actual customer service team decision logic. We then propose a few-shot, online hallucination detection framework that integrates NLG evaluation with automated scoring, tailored for both binary judgment and action-oriented user queries. Results: Experiments demonstrate a 55% hallucination detection rateโ€”marking the first empirical validation of automated hallucination detection for LLM customer agents in a real-world industrial setting. This work provides a reusable methodology and practical paradigm for trustworthy LLM evaluation in low-resource languages.

Technology Category

Application Category

๐Ÿ“ Abstract
Companies support their customers using live chats and chatbots to gain their loyalty. AFAS is a Dutch company aiming to leverage the opportunity large language models (LLMs) offer to answer customer queries with minimal to no input from its customer support team. Adding to its complexity, it is unclear what makes a response correct, and that too in Dutch. Further, with minimal data available for training, the challenge is to identify whether an answer generated by a large language model is correct and do it on the fly. This study is the first to define the correctness of a response based on how the support team at AFAS makes decisions. It leverages literature on natural language generation and automated answer grading systems to automate the decision-making of the customer support team. We investigated questions requiring a binary response (e.g., Would it be possible to adjust tax rates manually?) or instructions (e.g., How would I adjust tax rate manually?) to test how close our automated approach reaches support rating. Our approach can identify wrong messages in 55% of the cases. This work shows the viability of automatically assessing when our chatbot tell lies.
Problem

Research questions and friction points this paper is trying to address.

Assessing correctness of LLM-generated Dutch chatbot responses
Defining response correctness based on customer support decisions
Automating identification of incorrect chatbot answers with minimal data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Defining correctness based on support team decisions
Automating decision-making using NLG and grading systems
Identifying wrong chatbot messages in 55% of cases
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Herman Lassche
Product Development, AFAS Software, Leusden, The Netherlands
M
Michiel Overeem
Product Development, AFAS Software, Leusden, The Netherlands
Ayushi Rastogi
Ayushi Rastogi
Assistant Professor at the University of Groningen
code reviewhuman aspectsopen sourceempirical studiesmining software repositories