CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) struggle to align with human judgments of creativity, primarily due to the abstract nature of creativity definitions and the absence of comprehensive, multidimensional evaluation benchmarks. Method: We introduce CreBench—the first human-aligned creativity assessment benchmark covering the full creative pipeline (idea generation, process reasoning, and output evaluation)—and the high-quality multimodal instruction tuning dataset CreMIT, comprising 2.2K diverse samples, 79.2K fine-grained human annotations, and 4.7M instruction-response pairs. Leveraging GPT-assisted reconstruction of human feedback, we perform large-scale multimodal instruction tuning to train CreExpert. Contribution/Results: CreExpert significantly outperforms state-of-the-art models—including GPT-4V and Gemini-Pro-Vision—across multiple creativity evaluation tasks. It achieves markedly improved agreement with human judgments, establishing the first end-to-end, human-aligned computational framework for both creative generation and assessment.

Technology Category

Application Category

📝 Abstract
Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.
Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle to assess human-aligned creativity due to abstraction
No existing benchmark for multimodal creativity evaluation exists
Current models lack alignment with human creativity judgments
Innovation

Methods, ideas, or system contributions that make the work stand out.

CreBench benchmark covers creative idea to product dimensions
CreMIT dataset includes multimodal data and human feedbacks
Fine-tuned CreExpert model aligns with human creativity evaluation
🔎 Similar Papers
No similar papers found.