Assessing Web Search Credibility and Response Groundedness in Chat Assistants

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the information reliability challenge in chat assistants that integrate web search, aiming to mitigate risks of misinformation propagation from low-credibility sources. We construct a multi-topic evaluation dataset comprising 100 high-risk, easily mispropagated claims, combining human annotation with automated analysis to systematically assess leading models—including Perplexity and GPT-4o—on source credibility and response groundedness. We propose the first AI reliability evaluation framework tailored to high-risk information environments, operationalized along three dimensions: provenance capability, citation quality, and evidence consistency. Results show Perplexity achieves superior performance in citing high-credibility sources, whereas GPT-4o exhibits significantly higher reliance on low-credibility sources for sensitive topics—revealing substantial inter-model reliability disparities. This work establishes a reproducible benchmark and methodological foundation for developing trustworthy AI dialogue systems.

Technology Category

Application Category

📝 Abstract

Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants'web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.

Problem

Research questions and friction points this paper is trying to address.

Evaluating chat assistants' web search credibility and misinformation risks

Assessing how well responses are grounded in cited sources

Systematically comparing AI assistants' fact-checking behaviors across topics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates web search credibility and response groundedness

Assesses multiple AI assistants using misinformation-prone topics

Provides systematic comparison for fact-checking behavior evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow