🤖 AI Summary
This study addresses the imbalance between public readability and professional accuracy in German legal document summarization. We introduce the first benchmark dataset for public-oriented German judicial summarization—comprising 6.4k judgment–press-release pairs—and propose a triplet structure (judgment text, human-written press release, synthetic prompt) to support citizen-centered generation. A multidimensional evaluation framework is developed, integrating factual consistency verification, LLM-as-judge scoring, expert ranking, and conventional metrics (ROUGE/BERTScore). Methodologically, we adopt a hierarchical summarization strategy leveraging both small and large language models. Experiments show that large-model outputs approach human-level quality; small models, when hierarchically optimized, exhibit markedly improved long-text processing capability; and human-written press releases remain the optimal baseline. Our work fills critical gaps in legal NLP research concerning readability, accessibility, and civic communication.
📝 Abstract
Official court press releases from Germany's highest courts present and explain judicial rulings to the public, as well as to expert audiences. Prior NLP efforts emphasize technical headnotes, ignoring citizen-oriented communication needs. We introduce CourtPressGER, a 6.4k dataset of triples: rulings, human-drafted press releases, and synthetic prompts for LLMs to generate comparable releases. This benchmark trains and evaluates LLMs in generating accurate, readable summaries from long judicial texts. We benchmark small and large LLMs using reference-based metrics, factual-consistency checks, LLM-as-judge, and expert ranking. Large LLMs produce high-quality drafts with minimal hierarchical performance loss; smaller models require hierarchical setups for long judgments. Initial benchmarks show varying model performance, with human-drafted releases ranking highest.