🤖 AI Summary
Despite rapid advances in 3D generation, there is a critical lack of human-perception-aligned automatic evaluation metrics and large-scale, multidimensional human preference datasets for benchmarking. Method: We introduce 3DGen-Bench—the first open-source, large-scale human preference dataset for 3D generative models—featuring diverse text- and image-to-3D prompts, annotated jointly by domain experts and general users. To enable unified, quantitative assessment, we propose a dual-engine automatic evaluation framework: (i) 3DGen-Score, a CLIP-finetuned metric for perceptual quality; and (ii) 3DGen-Eval, an MLLM-based evaluator supporting cross-modal (text/image) input. Contribution/Results: Experiments demonstrate that 3DGen-Score achieves significantly higher correlation with human rankings than existing metrics, establishing a foundation for fair, standardized, and perception-grounded evaluation of 3D generative models.
📝 Abstract
3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications.