An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment

📅 2024-03-08

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

Current LLM-based sentence simplification evaluation suffers from two key limitations: (1) automatic metrics (e.g., BLEU, SARI) exhibit insufficient sensitivity to high-performing models, and (2) human evaluation is either superficial or overly labor-intensive, compromising reliability. To address these issues, we propose the first error-type-driven, fine-grained human annotation framework for systematic evaluation of models including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. Our methodology comprises a principled error taxonomy, multi-model comparative experiments, rigorous inter-annotator consistency control, and meta-evaluation of automatic metrics against human judgments. Results reveal that while GPT-4 achieves lower error rates than prior state-of-the-art systems, all models exhibit significant weaknesses in lexical rewriting. Moreover, mainstream automatic metrics show poor alignment with human assessments. This work establishes a more reliable, interpretable, and actionable evaluation paradigm for LLM sentence simplification capabilities.

Technology Category

Application Category

📝 Abstract

Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs' simplification capabilities. We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. We believe that these models offer a representative selection across large, medium, and small sizes of LLMs. Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's struggles with lexical paraphrasing. Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's and Qwen2.5-72B's struggle with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that these metrics lack sufficient sensitivity to assess the overall high-quality simplifications, particularly those generated by high-performance LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' sentence simplification using reliable methods

Assessing suitability of automatic metrics for LLM evaluation

Developing error-based human annotation for consistent assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Error-based human annotation framework for LLMs

Evaluation of diverse LLMs including GPT-4

Meta-evaluation of automatic metrics' sensitivity

🔎 Similar Papers

No similar papers found.

Authors to Follow