WritingBench: A Comprehensive Benchmark for Generative Writing

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) capabilities in high-quality, cross-domain generative writing. To address this, we introduce the first comprehensive generative writing benchmark covering six broad domains and 100 fine-grained subdomains. Our method proposes a query-dependent dynamic evaluation framework: (i) instance-level scoring criteria are generated via dynamic prompting; (ii) a criteria-aware fine-tuned critique model enables fine-grained modeling of style, formatting, and length constraints. Experiments demonstrate that a data-filtered 7B-parameter model achieves performance competitive with state-of-the-art (SOTA) larger models. All benchmark data, evaluation tools, and framework components are fully open-sourced. This work significantly advances writing evaluation along three dimensions: domain coverage, task adaptability, and scoring consistency—establishing a more rigorous, scalable, and generalizable foundation for assessing LLM-generated text across diverse writing tasks.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing creative, persuasive, informative, and technical writing. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables 7B-parameter models to approach state-of-the-art (SOTA) performance. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in diverse generative writing tasks
Addressing gaps in existing benchmarks for writing quality
Developing a dynamic, domain-specific evaluation framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for 6 writing domains
Query-dependent evaluation framework for LLMs
Fine-tuned critic model for criteria-aware scoring
🔎 Similar Papers
No similar papers found.
Yuning Wu
Yuning Wu
Wayne State University
perceptions of crime & justicepolice attitudes and behaviorsvictimizationcriminological theorieslaw and society
J
Jiahao Mei
Alibaba Group, Shanghai Jiao Tong University
M
Ming Yan
Alibaba Group
C
Chenliang Li
Alibaba Group
S
Shaopeng Lai
Alibaba Group
Y
Yuran Ren
Renmin University of China
Z
Zijia Wang
Renmin University of China
J
Ji Zhang
Alibaba Group
Mengyue Wu
Mengyue Wu
Shanghai Jiao Tong University
Speech perception and productionaffective computingaudio cognition
Qin Jin
Qin Jin
中国人民大学信息学院
人工智能
F
Fei Huang
Alibaba Group