LLM-KG-Bench 3.0: A Compass for SemanticTechnology Capabilities in the Ocean of LLMs

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This study addresses the critical question of which large language models (LLMs) best support knowledge graph (KG) and semantic web technologies—specifically in RDF/SPARQL understanding, Turtle/JSON-LD serialization, and related tasks. Method: We introduce the first automated benchmark tailored to KG semantic capabilities, featuring a scalable Semantic Technology Evaluation API that uniformly assesses over 30 open- and closed-source LLMs. Our framework integrates vLLM for efficient inference, RDF/SPARQL parsers, and serialization validators to enable multi-dimensional automatic scoring and consistency verification across six core tasks: RDF modeling, SPARQL query generation, serialization conversion, ontology alignment, triple extraction, and schema reasoning. Contribution/Results: We release the largest publicly available LLM-KG evaluation dataset to date and human-validated capability cards. Empirical results reveal substantial capability gaps among state-of-the-art LLMs in semantic web technologies, underscoring the need for targeted architectural and training improvements.

Technology Category

Application Category

📝 Abstract

Current Large Language Models (LLMs) can assist developing program code beside many other things, but can they support working with Knowledge Graphs (KGs) as well? Which LLM is offering the best capabilities in the field of Semantic Web and Knowledge Graph Engineering (KGE)? Is this possible to determine without checking many answers manually? The LLM-KG-Bench framework in Version 3.0 is designed to answer these questions. It consists of an extensible set of tasks for automated evaluation of LLM answers and covers different aspects of working with semantic technologies. In this paper the LLM-KG-Bench framework is presented in Version 3 along with a dataset of prompts, answers and evaluations generated with it and several state-of-the-art LLMs. Significant enhancements have been made to the framework since its initial release, including an updated task API that offers greater flexibility in handling evaluation tasks, revised tasks, and extended support for various open models through the vllm library, among other improvements. A comprehensive dataset has been generated using more than 30 contemporary open and proprietary LLMs, enabling the creation of exemplary model cards that demonstrate the models' capabilities in working with RDF and SPARQL, as well as comparing their performance on Turtle and JSON-LD RDF serialization tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' capabilities in Knowledge Graph tasks

Identifying best LLM for Semantic Web and KGE

Automating assessment of LLM performance on RDF/SPARQL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated evaluation of LLM semantic capabilities

Extensible task API for flexible KG assessments

Comprehensive dataset covering 30+ LLM models

🔎 Similar Papers

Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints