🤖 AI Summary
This work investigates the fundamental trade-off between rhetorical style and evidential quality in counterargument generation by large language models (LLMs). To this end, we introduce Counterfire, a dataset of 38,000 stylistically diverse counterarguments, and the first human-annotated argument triads—comprising claim, evidence, and rhetorical style—that enable joint modeling of style control and evidence integration. We evaluate six LLMs—including GPT-3.5 Turbo, GPT-4o, Claude Haiku, and LLaMA-3.1—on counterargument generation and multi-dimensional rhetorical quality assessment. Empirical results reveal that stylistic enhancement significantly improves persuasiveness (p < 0.01) but consistently reduces evidence density, confirming an intrinsic trade-off between these dimensions. Among all models, GPT-3.5 Turbo achieves the best performance yet remains substantially inferior to human-generated counterarguments. This work establishes a new benchmark for controllable and trustworthy argumentative generation and provides theoretical insights into the stylistic–evidential tension in LLM-based reasoning.
📝 Abstract
Large language models (LLMs) play a key role in generating evidence-based and stylistic counter-arguments, yet their effectiveness in real-world applications has been underexplored. Previous research often neglects the balance between evidentiality and style, which are crucial for persuasive arguments. To address this, we evaluated the effectiveness of stylized evidence-based counter-argument generation in Counterfire, a new dataset of 38,000 counter-arguments generated by revising counter-arguments to Reddit's ChangeMyView community to follow different discursive styles. We evaluated generic and stylized counter-arguments from basic and fine-tuned models such as GPT-3.5, PaLM-2, and Koala-13B, as well as newer models (GPT-4o, Claude Haiku, LLaMA-3.1) focusing on rhetorical quality and persuasiveness. Our findings reveal that humans prefer stylized counter-arguments over the original outputs, with GPT-3.5 Turbo performing well, though still not reaching human standards of rhetorical quality nor persuasiveness. Additionally, our work created a novel argument triplets dataset for studying style control, with human preference labels that provide insights into the tradeoffs between evidence integration and argument quality.