🤖 AI Summary
This work systematically evaluates the practical capabilities of large language models (LLMs) on graph reasoning tasks—specifically graph description understanding, connectivity judgment, and shortest path finding—revealing a substantial gap between their theoretical expressivity and empirical performance. We introduce a controlled benchmark, structured input representations, and a human-annotated evaluation protocol to conduct rigorous empirical analysis on both synthetic graph reasoning tasks and real-world knowledge graphs. Our study uncovers a previously unreported structural failure mode: LLMs consistently fail to reconstruct accurate graph topologies from natural-language descriptions, exhibiting severe, asymmetric error patterns. Crucially, we bridge the theory–practice divide by quantifying these limitations and providing actionable insights for model improvement. All code, datasets, and evaluation protocols are publicly released, establishing a reproducible foundation and critical empirical baseline for developing graph-aware language models.
📝 Abstract
Large Language Models (LLMs) have achieved great success in various reasoning tasks. In this work, we focus on the graph reasoning ability of LLMs. Although theoretical studies proved that LLMs are capable of handling graph reasoning tasks, empirical evaluations reveal numerous failures. To deepen our understanding on this discrepancy, we revisit the ability of LLMs on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest-path problem. Our findings suggest that LLMs can fail to understand graph structures through text descriptions and exhibit varying performance for all these three fundamental tasks. Meanwhile, we perform a real-world investigation on knowledge graphs and make consistent observations with our findings. The codes and datasets are available.