π€ AI Summary
This study investigates how users leverage web search to verify the factual accuracy of large language model (LLM) outputs and mitigate hallucination risks. A randomized controlled experiment (N=560) compared static versus dynamic search interfaces in terms of usersβ hallucination detection accuracy and confidence, incorporating the Need for Cognition (NFC) scale and a three-tier content annotation scheme (true / mild / severe hallucination). Results reveal that dynamic search significantly improves usersβ accuracy in identifying true statements and enhances overall confidence; both static and dynamic search reduce perceived hallucination severity; and high-NFC users demonstrate greater sensitivity to severe hallucinations. This work provides the first empirical evidence of key cognitive mechanisms underlying human-AI collaborative verification, offering foundational insights for designing hallucination-resilient interactive interfaces grounded in both empirical data and cognitive theory.
π Abstract
While we increasingly rely on large language models (LLMs) for various tasks, these models are known to produce inaccurate content or 'hallucinations' with potentially disastrous consequences. The recent integration of web search results into LLMs prompts the question of whether people utilize them to verify the generated content, thereby avoiding falling victim to hallucinations. This study (N = 560) investigated how the provision of search results, either static (fixed search results) or dynamic (participant-driven searches), affect participants' perceived accuracy and confidence in evaluating LLM-generated content (i.e., genuine, minor hallucination, major hallucination), compared to the control condition (no search results). Findings indicate that participants in both static and dynamic conditions (vs. control) rated hallucinated content to be less accurate. However, those in the dynamic condition rated genuine content as more accurate and demonstrated greater overall confidence in their assessments than those in the static or control conditions. In addition, those higher in need for cognition (NFC) rated major hallucinations to be less accurate than low NFC participants, with no corresponding difference for genuine content or minor hallucinations. These results underscore the potential benefits of integrating web search results into LLMs for the detection of hallucinations, as well as the need for a more nuanced approach when developing human-centered systems, taking user characteristics into account.