Steering LLM Viewpoints through Fabricated Evidence Injection

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work demonstrates that large language models are susceptible to deceptive evidence adorned with markers of credibility, often uncritically adopting erroneous claims. To expose this vulnerability, the authors introduce Ghostwriter, a two-stage adversarial framework that first generates superficially plausible false arguments and then injects them into the model’s context to steer its responses toward endorsing these fabricated viewpoints. This study is the first to systematically characterize the cognitive fragility of large language models when confronted with misinformation bearing credible surface features, establishing a reproducible paradigm for perspective manipulation attacks. Empirical evaluations reveal that this vulnerability is prevalent across mainstream models. Furthermore, the proposed tailored defense mechanism enhances the attack detection rate of gpt-oss-safeguard to 81%.

📝 Abstract

As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce Ghostwriter, a two-phase attack framework that first repackages misleading statements with fabricated rationales, then instruct target LLMs to incorporate these viewpoints when responding to relevant queries. Experiments on BBQ, ToxiGen, and our specialized dataset reveal that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT-5.4) reduce but do not eliminate the attack. Building on this, we explore multiple defense strategies, among which a tailored safety policy enables gpt-oss-safeguard to achieve 81% detection rate.

Problem

Research questions and friction points this paper is trying to address.

LLM vulnerability

fabricated evidence

misleading responses

cognitive trust

viewpoint manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

fabricated evidence injection

Ghostwriter framework

LLM cognitive vulnerability