"Reasoning"with Rhetoric: On the Style-Evidence Tradeoff in LLM-Generated Counter-Arguments

📅 2024-02-13

📈 Citations: 1

✨ Influential: 1

🤖 AI Summary

This work investigates the fundamental trade-off between rhetorical style and evidential quality in counterargument generation by large language models (LLMs). To this end, we introduce Counterfire, a dataset of 38,000 stylistically diverse counterarguments, and the first human-annotated argument triads—comprising claim, evidence, and rhetorical style—that enable joint modeling of style control and evidence integration. We evaluate six LLMs—including GPT-3.5 Turbo, GPT-4o, Claude Haiku, and LLaMA-3.1—on counterargument generation and multi-dimensional rhetorical quality assessment. Empirical results reveal that stylistic enhancement significantly improves persuasiveness (p < 0.01) but consistently reduces evidence density, confirming an intrinsic trade-off between these dimensions. Among all models, GPT-3.5 Turbo achieves the best performance yet remains substantially inferior to human-generated counterarguments. This work establishes a new benchmark for controllable and trustworthy argumentative generation and provides theoretical insights into the stylistic–evidential tension in LLM-based reasoning.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) play a key role in generating evidence-based and stylistic counter-arguments, yet their effectiveness in real-world applications has been underexplored. Previous research often neglects the balance between evidentiality and style, which are crucial for persuasive arguments. To address this, we evaluated the effectiveness of stylized evidence-based counter-argument generation in Counterfire, a new dataset of 38,000 counter-arguments generated by revising counter-arguments to Reddit's ChangeMyView community to follow different discursive styles. We evaluated generic and stylized counter-arguments from basic and fine-tuned models such as GPT-3.5, PaLM-2, and Koala-13B, as well as newer models (GPT-4o, Claude Haiku, LLaMA-3.1) focusing on rhetorical quality and persuasiveness. Our findings reveal that humans prefer stylized counter-arguments over the original outputs, with GPT-3.5 Turbo performing well, though still not reaching human standards of rhetorical quality nor persuasiveness. Additionally, our work created a novel argument triplets dataset for studying style control, with human preference labels that provide insights into the tradeoffs between evidence integration and argument quality.

Problem

Research questions and friction points this paper is trying to address.

Balancing evidentiality and style in LLM-generated counter-arguments

Evaluating persuasiveness of stylized evidence-based counter-arguments

Assessing human preference for stylized vs generic counter-arguments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated stylized evidence-based counter-argument generation

Used Counterfire dataset with 38,000 counter-arguments

Created novel argument triplets dataset for style control

🔎 Similar Papers

No similar papers found.

Authors to Follow