🤖 AI Summary
This study investigates the human-like capability of large language models (LLMs) in multimodal perceptual intensity modeling, benchmarking against human cross-sensory (e.g., visual–auditory–tactile) intensity ratings as ground truth.
Method: We introduce the first perceptual-intensity-based quantitative benchmark and a contrastive evaluation framework integrating quantitative correlation analysis with qualitative error-pattern mining.
Contribution/Results: GPT-4 and GPT-4o significantly outperform GPT-3.5 and GPT-4o-mini, yet GPT-4o does not surpass GPT-4—suggesting current multimodal fusion fails to enhance embodied perceptual grounding. Models consistently exhibit non-human reasoning patterns, including multisensory overestimation and reliance on superficial semantic associations. This work pioneers the integration of perceptual intensity into LLM multimodal evaluation, establishing a novel paradigm and reproducible benchmark for semantic grounding and embodied intelligence research.
📝 Abstract
This study investigated the multimodal perception of large language models (LLMs), focusing on their ability to capture human-like perceptual strength ratings across sensory modalities. Utilizing perceptual strength ratings as a benchmark, the research compared GPT-3.5, GPT-4, GPT-4o, and GPT-4o-mini, highlighting the influence of multimodal inputs on grounding and linguistic reasoning. While GPT-4 and GPT-4o demonstrated strong alignment with human evaluations and significant advancements over smaller models, qualitative analyses revealed distinct differences in processing patterns, such as multisensory overrating and reliance on loose semantic associations. Despite integrating multimodal capabilities, GPT-4o did not exhibit superior grounding compared to GPT-4, raising questions about their role in improving human-like grounding. These findings underscore how LLMs' reliance on linguistic patterns can both approximate and diverge from human embodied cognition, revealing limitations in replicating sensory experiences.