🤖 AI Summary
This work addresses the lack of systematic evaluation for fine-grained elements, scientific reasoning capabilities, and output conciseness in current text-to-image (T2I) models when generating natural science illustrations. To this end, the authors propose FEPBench—the first three-dimensional, fine-grained benchmark specifically designed for natural science illustrations—assessing instruction fidelity, reasoning richness, and semantic precision. The benchmark introduces an atomic set annotation framework encompassing visual, textual, relational, and layout components. Efficient quantitative evaluation is achieved through multimodal large language model–assisted annotation, expert validation, and structured parsing. Experimental results reveal that even state-of-the-art closed-source models suffer from text rendering errors, insufficient reasoning capacity, and a trade-off between richness and precision, thereby offering clear directions for future model improvement.
📝 Abstract
Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.