Modeling Understanding of Story-Based Analogies Using Large Language Models

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the alignment between large language models (LLMs) and human cognition on story-based analogical reasoning. Methodologically, it employs fine-grained, instance-level analysis—moving beyond aggregate accuracy—to quantify semantic similarity via sentence embeddings, assess distractor discrimination, and elicit explicit analogical explanations through targeted prompting. Experiments span major architectures—including GPT-4 and LLaMA3—and multiple parameter scales (8B–70B). The key contribution is the first demonstration that, while LLMs capture cross-story semantic similarity at the individual mapping level, they exhibit markedly lower reasoning consistency and structural mapping stability than humans. Crucially, this performance gap stems primarily from architectural limitations rather than insufficient parameter count. Results reveal a fundamental constraint in current LLMs’ analogical reasoning mechanisms: they lack the robust, structure-preserving inference capacities characteristic of human analogical cognition.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Models (LLMs) have brought them closer to matching human cognition across a variety of tasks. How well do these models align with human performance in detecting and mapping analogies? Prior research has shown that LLMs can extract similarities from analogy problems but lack robust human-like reasoning. Building on Webb, Holyoak, and Lu (2023), the current study focused on a story-based analogical mapping task and conducted a fine-grained evaluation of LLM reasoning abilities compared to human performance. First, it explored the semantic representation of analogies in LLMs, using sentence embeddings to assess whether they capture the similarity between the source and target texts of an analogy, and the dissimilarity between the source and distractor texts. Second, it investigated the effectiveness of explicitly prompting LLMs to explain analogies. Throughout, we examine whether LLMs exhibit similar performance profiles to those observed in humans by evaluating their reasoning at the level of individual analogies, and not just at the level of overall accuracy (as prior studies have done). Our experiments include evaluating the impact of model size (8B vs. 70B parameters) and performance variation across state-of-the-art model architectures such as GPT-4 and LLaMA3. This work advances our understanding of the analogical reasoning abilities of LLMs and their potential as models of human reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' analogy detection and mapping compared to humans
Assessing semantic representation of analogies in LLMs using embeddings
Testing explicit prompting for analogy explanation in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating analogy mapping using sentence embeddings
Prompting LLMs to explain analogies explicitly
Comparing model sizes and architectures systematically
🔎 Similar Papers
No similar papers found.