π€ AI Summary
This work demonstrates that large language models are susceptible to deceptive evidence adorned with markers of credibility, often uncritically adopting erroneous claims. To expose this vulnerability, the authors introduce Ghostwriter, a two-stage adversarial framework that first generates superficially plausible false arguments and then injects them into the modelβs context to steer its responses toward endorsing these fabricated viewpoints. This study is the first to systematically characterize the cognitive fragility of large language models when confronted with misinformation bearing credible surface features, establishing a reproducible paradigm for perspective manipulation attacks. Empirical evaluations reveal that this vulnerability is prevalent across mainstream models. Furthermore, the proposed tailored defense mechanism enhances the attack detection rate of gpt-oss-safeguard to 81%.
π Abstract
As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce Ghostwriter, a two-phase attack framework that first repackages misleading statements with fabricated rationales, then instruct target LLMs to incorporate these viewpoints when responding to relevant queries. Experiments on BBQ, ToxiGen, and our specialized dataset reveal that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT-5.4) reduce but do not eliminate the attack. Building on this, we explore multiple defense strategies, among which a tailored safety policy enables gpt-oss-safeguard to achieve 81% detection rate.