🤖 AI Summary
This study addresses the core challenge in large-scale biomedical text simplification—balancing readability and semantic fidelity—to improve public access to health information. We propose a large language model (LLM)-driven simplification framework based on instruction tuning. For the first time, we systematically validate the architectural advantages of Mistral-24B for this task: it significantly outperforms Qwen2.5-32B across 21 quantitative metrics—including SARI (42.46 vs. lower baseline) and BERTScore (0.91 vs. 0.89)—spanning readability, faithfulness, and safety, achieving near-human-level semantic consistency. Our work establishes a reproducible technical pipeline and a comprehensive evaluation benchmark for high-fidelity,科普-oriented biomedical text generation.
📝 Abstract
The increasing health-seeking behavior and digital consumption of biomedical information by the general public necessitate scalable solutions for automatically adapting complex scientific and technical documents into plain language. Automatic text simplification solutions, including advanced large language models, however, continue to face challenges in reliably arbitrating the tension between optimizing readability performance and ensuring preservation of discourse fidelity. This report empirically assesses the performance of two major classes of general-purpose LLMs, demonstrating their linguistic capabilities and foundational readiness for the task compared to a human benchmark. Using a comparative analysis of the instruction-tuned Mistral 24B and the reasoning-augmented QWen2.5 32B, we identify a potential architectural advantage in the instruction-tuned LLM. Mistral exhibits a tempered lexical simplification strategy that enhances readability across a suite of metrics and the simplification-specific formula SARI (mean 42.46), while preserving human-level discourse with a BERTScore of 0.91. QWen also attains enhanced readability performance, but its operational strategy shows a disconnect in balancing between readability and accuracy, reaching a statistically significantly lower BERTScore of 0.89. Additionally, a comprehensive correlation analysis of 21 metrics spanning readability, discourse fidelity, content safety, and underlying distributional measures for mechanistic insights, confirms strong functional redundancies among five readability indices. This empirical evidence tracks baseline performance of the evolving LLMs for the task of text simplification, identifies the instruction-tuned Mistral 24B for simplification, provides necessary heuristics for metric selection, and points to lexical support as a primary domain-adaptation issue for simplification.