🤖 AI Summary
Existing argument keypoint extraction methods predominantly rely on the ArgKP21 dataset, which suffers from limited scale, narrow topical coverage, and insufficient long-range coreference and subjective expression—failing to reflect the complexity of real-world online debates. To address this, we introduce ArgCMV: the first benchmark dataset for argument keypoint extraction tailored to the large language model (LLM) era, constructed from authentic discussions on the ChangeMyView (CMV) forum. ArgCMV comprises approximately 12K arguments spanning 3K+ diverse topics, significantly enhancing context length, semantic structural complexity, and subjective discourse density. Annotations combine state-of-the-art LLM assistance with rigorous human verification to ensure high quality. We systematically evaluate leading open-source models on ArgCMV, uncovering critical performance bottlenecks in realistic argument understanding tasks. ArgCMV fills a vital gap in high-fidelity, dialogue-oriented argument modeling and establishes a new standard—and a set of concrete challenges—for next-generation argument understanding research.
📝 Abstract
Key point extraction is an important task in argument summarization which involves extracting high-level short summaries from arguments. Existing approaches for KP extraction have been mostly evaluated on the popular ArgKP21 dataset. In this paper, we highlight some of the major limitations of the ArgKP21 dataset and demonstrate the need for new benchmarks that are more representative of actual human conversations. Using SoTA large language models (LLMs), we curate a new argument key point extraction dataset called ArgCMV comprising of around 12K arguments from actual online human debates spread across over 3K topics. Our dataset exhibits higher complexity such as longer, co-referencing arguments, higher presence of subjective discourse units, and a larger range of topics over ArgKP21. We show that existing methods do not adapt well to ArgCMV and provide extensive benchmark results by experimenting with existing baselines and latest open source models. This work introduces a novel KP extraction dataset for long-context online discussions, setting the stage for the next generation of LLM-driven summarization research.