Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing image generation evaluation metrics—such as BLEU, CIDEr, and CLIPScore—exhibit limited fidelity in domain-specific and context-sensitive scenarios, failing to adequately capture semantic plausibility and structural-physical consistency. To address this, we propose a physics-aware multimodal evaluation framework: first extracting spatial-semantic features, then performing confidence-weighted fusion of outputs from vision-language models, object detectors, and large language models (LLMs). Crucially, we introduce a physics-guided LLM reasoning mechanism that integrates component-level adaptive verification and domain-knowledge mapping to enable cross-modal consistency modeling. Our hierarchical three-tier architecture significantly enhances discriminative capability for both semantic and structural accuracy of synthesized images. Extensive evaluation demonstrates superior correlation with human judgment and greater robustness compared to state-of-the-art metrics, particularly in specialized domains and complex contextual tasks.

Technology Category

Application Category

📝 Abstract

Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.

Problem

Research questions and friction points this paper is trying to address.

Existing metrics fail to capture semantic and structural accuracy in synthetic images

Current evaluation methods struggle with domain-specific and context-dependent scenarios

There is a need for physics-constrained multimodal evaluation of synthetic image quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-constrained multimodal metric combining LLMs and VLMs

Three-stage architecture with feature extraction and fusion

Physics-guided reasoning for structural constraint enforcement

🔎 Similar Papers

No similar papers found.