Not too long do read: Evaluating LLM-generated extreme scientific summaries

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of high-quality, researcher-authored scientific TLDR (Too Long; Didn’t Read) benchmarks impedes rigorous evaluation and optimization of large language models (LLMs) for this task. To address this, we introduce BiomedTLDR—the first author-annotated TLDR dataset in biomedicine—and propose a multidimensional evaluation framework integrating ROUGE, BERTScore, and expert human assessment. This work establishes the first large-scale, high-fidelity TLDR benchmark grounded in authentic author feedback. Our analysis reveals that prevailing open-source LLMs predominantly rely on extractive generation and exhibit substantially weaker abstraction capabilities than human experts. Furthermore, we demonstrate that supervised fine-tuning on abstractive TLDRs improves model performance, yet persistent bottlenecks remain—including insufficient semantic condensation and rigid output structure. Collectively, this study provides a novel benchmark, empirical insights, and methodological foundations for advancing scientific TLDR generation.

Technology Category

Application Category

📝 Abstract
High-quality scientific extreme summary (TLDR) facilitates effective science communication. How do large language models (LLMs) perform in generating them? How are LLM-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of LLMs' summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors' comments alongside bibliography items. We then test popular open-weight LLMs for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at https://github.com/netknowledge/LLM_summarization (Lyu and Ke, 2025).
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM-generated scientific extreme summaries (TLDRs)
Compares LLM and human expert summaries for differences
Addresses lack of dataset for LLM summarization development and evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset creation using author comments
Testing LLMs on abstract-to-summary generation
Analyzing extractive versus abstractive summarization tendencies
Z
Zhuoqi Lyu
Department of Data Science, City University of Hong Kong, Hong Kong, China
Qing Ke
Qing Ke
City University of Hong Kong
Data ScienceInnovationComplex SystemsCheminformatics