CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

📅 2024-10-16

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Scientific abstract generation requires simultaneous control over multiple dimensions—such as length, empirical focus, and implicit stylistic attributes—to suit diverse audiences; however, existing methods predominantly address only single-attribute control and lack systematic evaluation of compositional controllability across explicit and implicit attributes. Method: We introduce CCSBench, the first benchmark for compositional controllability in scientific summarization, formally defining and quantifying large language models’ (LLMs) capabilities under multi-attribute constraints (e.g., length + empirical emphasis). Leveraging prompt engineering, controllable generation evaluation frameworks, and a multidimensional human-AI hybrid assessment protocol, we conduct empirical analysis on GPT-4, LLaMA2, and other state-of-the-art LLMs. Contribution/Results: Our evaluation reveals significant performance degradation in controlling implicit attributes, exposing fundamental limitations in abstract reasoning and multi-objective trade-off handling. CCSBench establishes a reproducible benchmark and delivers critical diagnostic insights to advance controllable scientific summarization research.

Technology Category

Application Category

📝 Abstract

To broaden the dissemination of scientific knowledge to diverse audiences, scientific document summarization must simultaneously control multiple attributes such as length and empirical focus. However, existing research typically focuses on controlling single attributes, leaving the compositional control of multiple attributes underexplored. To address this gap, we introduce CCSBench, a benchmark for compositional controllable summarization in the scientific domain. Our benchmark enables fine-grained control over both explicit attributes (e.g., length), which are objective and straightforward, and implicit attributes (e.g., empirical focus), which are more subjective and conceptual. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our findings reveal significant limitations in large language models' ability to balance trade-offs between control attributes, especially implicit ones that require deeper understanding and abstract reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluates compositional control in LLMs for scientific summarization

Addresses underexplored multi-attribute control in summaries

Assesses trade-offs between explicit and implicit summary attributes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CCSBench for compositional control evaluation

Enables fine-grained control over explicit and implicit attributes

Tests LLMs with in-context learning and fine-tuning

🔎 Similar Papers

No similar papers found.

Authors to Follow