🤖 AI Summary
Scientific abstract generation requires simultaneous control over multiple dimensions—such as length, empirical focus, and implicit stylistic attributes—to suit diverse audiences; however, existing methods predominantly address only single-attribute control and lack systematic evaluation of compositional controllability across explicit and implicit attributes.
Method: We introduce CCSBench, the first benchmark for compositional controllability in scientific summarization, formally defining and quantifying large language models’ (LLMs) capabilities under multi-attribute constraints (e.g., length + empirical emphasis). Leveraging prompt engineering, controllable generation evaluation frameworks, and a multidimensional human-AI hybrid assessment protocol, we conduct empirical analysis on GPT-4, LLaMA2, and other state-of-the-art LLMs.
Contribution/Results: Our evaluation reveals significant performance degradation in controlling implicit attributes, exposing fundamental limitations in abstract reasoning and multi-objective trade-off handling. CCSBench establishes a reproducible benchmark and delivers critical diagnostic insights to advance controllable scientific summarization research.
📝 Abstract
To broaden the dissemination of scientific knowledge to diverse audiences, scientific document summarization must simultaneously control multiple attributes such as length and empirical focus. However, existing research typically focuses on controlling single attributes, leaving the compositional control of multiple attributes underexplored. To address this gap, we introduce CCSBench, a benchmark for compositional controllable summarization in the scientific domain. Our benchmark enables fine-grained control over both explicit attributes (e.g., length), which are objective and straightforward, and implicit attributes (e.g., empirical focus), which are more subjective and conceptual. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our findings reveal significant limitations in large language models' ability to balance trade-offs between control attributes, especially implicit ones that require deeper understanding and abstract reasoning.