MÖVE: A Holistic LLM Benchmark for the German Public Sector

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the lack of comprehensive evaluation frameworks for large language models (LLMs) in the German public sector, where existing benchmarks predominantly emphasize English-language task performance while neglecting governance considerations. To bridge this gap, the authors propose the first dual-dimensional assessment framework integrating both performance and governance criteria. The framework evaluates tasks such as summarization, question answering, and topic extraction, and introduces governance metrics including hallucination rates, energy consumption, transparency, and alignment with German constitutional values and political party positions. Leveraging a newly curated German-language public administration dataset—comprising ten gold- and silver-standard benchmarks—the study systematically assesses 39 models using classical NLP metrics, embedding similarity, and LLM-as-a-judge approaches. Results reveal no single model dominates across all dimensions, and model scale proves an unreliable proxy for quality; the proposed benchmark demonstrates strong statistical stability, scoring reliability, and sensitivity to data and prompt variations.

📝 Abstract

We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. MÖVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. MÖVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

Problem

Research questions and friction points this paper is trying to address.

large language models

public sector

benchmark

German language

model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

holistic benchmark

public sector LLM evaluation

German-language NLP