GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

📅 2024-07-03

🏛️ International Conference on Computational Linguistics

📈 Citations: 2

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing graph understanding benchmarks are limited to homogeneous graphs and suffer from ambiguous capability definitions, hindering systematic evaluation of large language models’ (LLMs) graph reasoning abilities. This paper introduces GraCoRe, the first comprehensive benchmark covering both homogeneous and heterogeneous graphs. It establishes a three-tier capability taxonomy, a 10-dimensional capability framework, and 19 fine-grained tasks, evaluated across 5,140 diverse graph structures. Methodologically, GraCoRe constructs a multi-source graph dataset, designs capability-oriented tasks with granular metrics, and conducts cross-model evaluation on 12 state-of-the-art LLMs—including o1. Key contributions include: (1) the first unified evaluation protocol for both graph types; (2) empirical identification of critical factors—e.g., semantic augmentation and node ordering—that significantly impact performance; and (3) the finding that extended-context capability does not strongly correlate with graph reasoning proficiency. All data, tasks, and code are publicly released.

Technology Category

Application Category

📝 Abstract

Evaluating the graph comprehension and reasoning abilities of Large Language Models (LLMs) is challenging and often incomplete. Existing benchmarks focus primarily on pure graph understanding, lacking a comprehensive evaluation across all graph types and detailed capability definitions. This paper presents GraCoRe, a benchmark for systematically assessing LLMs' graph comprehension and reasoning. GraCoRe uses a three-tier hierarchical taxonomy to categorize and test models on pure graph and heterogeneous graphs, subdividing capabilities into 10 distinct areas tested through 19 tasks. Our benchmark includes 11 datasets with 5,140 graphs of varying complexity. We evaluate four closed-source and eight open-source LLMs, conducting thorough analyses from both ability and task perspectives. Key findings reveal that OpenAI o1 model has amazing comprehension and reasoning capabilities, semantic enrichment enhances reasoning performance, node ordering impacts task success, and the ability to process longer texts does not necessarily improve graph comprehension or reasoning.GraCoRe is open-sourced at https://github.com/ZIKEYUAN/GraCoRe

Problem

Research questions and friction points this paper is trying to address.

Evaluating graph comprehension in LLMs

Assessing complex reasoning abilities

Benchmarking across diverse graph types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical taxonomy for graph categorization

Evaluates 10 distinct graph reasoning areas

Includes 11 datasets with 5,140 graphs

🔎 Similar Papers

How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension