🤖 AI Summary
This work addresses how large language models (LLMs) internalize knowledge graph (KG) sequences during pretraining and generalize them into reusable knowledge. To this end, we introduce the concept of *prototype knowledge*, formally characterizing KG internalization as three structured forms: lexical, hierarchical, and topological knowledge. We propose Knowledge Activation Tasks (KATs) as a quantitative evaluation framework and establish a novel semantic-level data contamination analysis paradigm. Through controlled experiments comparing KG embedding, sequence modeling, semantic alignment, and prompting strategies, we demonstrate that prototype knowledge significantly influences Text-to-SPARQL performance, with its semantic bias strongly correlating with generalization capability. Our findings provide an interpretable, measurable empirical foundation for understanding how LLMs represent and leverage structured knowledge—bridging gaps between KG semantics and LLM pretraining dynamics.
📝 Abstract
We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.