🤖 AI Summary
This work addresses the lack of effective evaluation of open-ended reasoning capabilities of large language models (LLMs) on graph-structured data, particularly for tasks requiring integration of node features with their neighborhood context. To this end, we introduce GraphInfer-Bench, a novel benchmark that systematically defines and evaluates graph reasoning through five types of descriptive and comparative tasks whose answers cannot be derived from a single node or path alone. A rigorous four-stage quality control pipeline ensures high dataset fidelity. Experiments on 42,000 samples reveal that current LLMs fail to fully solve these tasks, while standard graph neural networks (GNNs) match or outperform even the strongest closed-source LLMs in zero-shot settings across most categories. We further compare diverse approaches—including graph-text alignment models, LLM zero-shot inference, Graph2Text fine-tuning, and baseline GNNs—highlighting key challenges and promising directions for advancing graph-based reasoning.
📝 Abstract
Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.