Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses how large language models (LLMs) internalize knowledge graph (KG) sequences during pretraining and generalize them into reusable knowledge. To this end, we introduce the concept of *prototype knowledge*, formally characterizing KG internalization as three structured forms: lexical, hierarchical, and topological knowledge. We propose Knowledge Activation Tasks (KATs) as a quantitative evaluation framework and establish a novel semantic-level data contamination analysis paradigm. Through controlled experiments comparing KG embedding, sequence modeling, semantic alignment, and prompting strategies, we demonstrate that prototype knowledge significantly influences Text-to-SPARQL performance, with its semantic bias strongly correlating with generalization capability. Our findings provide an interpretable, measurable empirical foundation for understanding how LLMs represent and leverage structured knowledge—bridging gaps between KG semantics and LLM pretraining dynamics.

Technology Category

Application Category

📝 Abstract

We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.

Problem

Research questions and friction points this paper is trying to address.

How LLMs internalize token sequences from Knowledge Graphs during pretraining

Measuring protoknowledge types (lexical, hierarchical, topological) via Knowledge Activation Tasks

Assessing protoknowledge impact on Text-to-SPARQL performance with prompting strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing protoknowledge to measure token sequence internalization

Categorizing protoknowledge into lexical, hierarchical, topological forms

Using Knowledge Activation Tasks to analyze semantic bias

🔎 Similar Papers

Large Language Model Enhanced Knowledge Representation Learning: A Survey