🤖 AI Summary
Large language models (LLMs) exhibit semantic bottlenecks when interpreting and generating SPARQL SELECT queries for knowledge graph (KG) integration. Method: We introduce LLM-KG-Bench—the first automated benchmark framework specifically designed for SPARQL SELECT—evaluating LLMs across four dimensions: syntactic correctness, semantic parsing, semantic construction, and robustness to KG-specific prompting. Using this framework, we conduct multi-task automatic evaluation on state-of-the-art models including GPT, Gemini, and Claude. Contribution/Results: Results show that while LLMs efficiently correct basic syntactic errors, their accuracy drops significantly in complex semantic scenarios; performance varies markedly with task difficulty and model architecture. This work provides the first systematic characterization of LLM limitations at the SPARQL semantic level and establishes a reproducible, empirically grounded evaluation benchmark to advance KG-LLM collaborative reasoning.
📝 Abstract
The integration of Large Language Models (LLMs) with Knowledge Graphs (KGs) offers significant synergistic potential for knowledge-driven applications. One possible integration is the interpretation and generation of formal languages, such as those used in the Semantic Web, with SPARQL being a core technology for accessing KGs. In this paper, we focus on measuring out-of-the box capabilities of LLMs to work with SPARQL and more specifically with SPARQL SELECT queries applying a quantitative approach. We implemented various benchmarking tasks in the LLM-KG-Bench framework for automated execution and evaluation with several LLMs. The tasks assess capabilities along the dimensions of syntax, semantic read, semantic create, and the role of knowledge graph prompt inclusion. With this new benchmarking tasks, we evaluated a selection of GPT, Gemini, and Claude models. Our findings indicate that working with SPARQL SELECT queries is still challenging for LLMs and heavily depends on the specific LLM as well as the complexity of the task. While fixing basic syntax errors seems to pose no problems for the best of the current LLMs evaluated, creating semantically correct SPARQL SELECT queries is difficult in several cases.