GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

📅 2024-06-29

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Large language models (LLMs) lack rigorous evaluation on real-world graph computation tasks, suffering from both performance gaps and frequent hallucination. Method: We introduce GraphBench—the first benchmark tailored to realistic graph algorithms—covering four P-class and six NP-complete problems. Our graph-specific evaluation framework employs dual criteria (feasibility and optimality) and enables fine-grained response classification (correct, suboptimal, hallucinated, missing). Test cases are generated from formal algorithm specifications; verification combines rule-based checkers with multi-granularity classification. We systematically evaluate over ten mainstream LLMs. Contribution/Results: We reveal, for the first time, a sharp accuracy drop and >40% hallucination rate on large-scale graph problems. Empirically, code generation most effectively improves solution feasibility, while test-time computational scaling significantly boosts optimal solution rates. Our open-source tools have been widely adopted by the community.

Technology Category

Application Category

📝 Abstract

The ``arms race'' of Large Language Models (LLMs) demands new benchmarks to examine their progresses. In this paper, we introduce GraphArena, a benchmarking tool designed to evaluate LLMs on real-world graph computational problems. It offers a suite of four polynomial-time tasks (e.g., Shortest Distance) and six NP-complete challenges (e.g., Traveling Salesman Problem). GraphArena features a rigorous evaluation framework that classifies LLM outputs as correct, suboptimal (feasible but not optimal), hallucinatory (properly formatted but infeasible), or missing. Evaluation of over 10 LLMs reveals that even top-performing LLMs struggle with larger, more complex graph problems and exhibit hallucination issues. We further explore four potential solutions to address this issue and improve LLMs on graph computation, including chain-of-thought prompting, instruction tuning, code writing, and scaling test-time compute, each demonstrating unique strengths and limitations. GraphArena complements the existing LLM benchmarks and is open-sourced at https://github.com/squareRoot3/GraphArena.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on graph computational problems

Introduces GraphArena for rigorous LLM evaluation

Explores solutions to improve LLMs on graph tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

GraphArena benchmarking tool

polynomial-time and NP-complete tasks

chain-of-thought prompting technique

🔎 Similar Papers

LLM-Enhanced User-Item Interactions: Leveraging Edge Information for Optimized Recommendations