Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of layperson-oriented simplification of biomedical literature for patients and caregivers. We propose a sentence-level rewriting and terminology identification-replacement framework powered by large language models (LLMs), accompanied by four expert-crafted reference corpora for automated evaluation—marking the first systematic demonstration of significantly low correlation between conventional automatic metrics and human judgments. Our methodology encompasses diverse architectures—from MLPs to pretrained Transformers—and integrates expert review with reference-driven evaluation. Experimental results show that the best-performing model achieves near-human performance in factual accuracy and completeness; LLMs excel at terminology generation but remain suboptimal in terminology identification, semantic classification, and conciseness. This work advances the paradigm of public-facing medical text simplification and establishes a more rigorous, trustworthy evaluation infrastructure for biomedical text simplification systems.

Technology Category

Application Category

📝 Abstract

Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.

Problem

Research questions and friction points this paper is trying to address.

Adapt biomedical literature to plain language for patients

Evaluate language models for factual accuracy and simplicity

Improve automatic benchmarking tools for biomedical text adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used large pretrained transformers for text adaptation

Developed four-fold reference set for evaluation

Combined automatic and manual evaluation methods

🔎 Similar Papers

No similar papers found.

Authors to Follow