CourtPressGER: A German Court Decision to Press Release Summarization Dataset

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses the imbalance between public readability and professional accuracy in German legal document summarization. We introduce the first benchmark dataset for public-oriented German judicial summarization—comprising 6.4k judgment–press-release pairs—and propose a triplet structure (judgment text, human-written press release, synthetic prompt) to support citizen-centered generation. A multidimensional evaluation framework is developed, integrating factual consistency verification, LLM-as-judge scoring, expert ranking, and conventional metrics (ROUGE/BERTScore). Methodologically, we adopt a hierarchical summarization strategy leveraging both small and large language models. Experiments show that large-model outputs approach human-level quality; small models, when hierarchically optimized, exhibit markedly improved long-text processing capability; and human-written press releases remain the optimal baseline. Our work fills critical gaps in legal NLP research concerning readability, accessibility, and civic communication.

Technology Category

Application Category

📝 Abstract

Official court press releases from Germany's highest courts present and explain judicial rulings to the public, as well as to expert audiences. Prior NLP efforts emphasize technical headnotes, ignoring citizen-oriented communication needs. We introduce CourtPressGER, a 6.4k dataset of triples: rulings, human-drafted press releases, and synthetic prompts for LLMs to generate comparable releases. This benchmark trains and evaluates LLMs in generating accurate, readable summaries from long judicial texts. We benchmark small and large LLMs using reference-based metrics, factual-consistency checks, LLM-as-judge, and expert ranking. Large LLMs produce high-quality drafts with minimal hierarchical performance loss; smaller models require hierarchical setups for long judgments. Initial benchmarks show varying model performance, with human-drafted releases ranking highest.

Problem

Research questions and friction points this paper is trying to address.

Generates readable summaries from German court rulings

Trains LLMs to create citizen-oriented press releases

Evaluates model performance on legal text summarization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset with triples for LLM training

Benchmarking LLMs with multiple evaluation metrics

Hierarchical setups for smaller model performance

🔎 Similar Papers

No similar papers found.