Large Language Model Sourcing: A Survey

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of tracing provenance and ensuring credibility of large language model (LLM)–generated content, this paper proposes the first holistic four-dimensional provenance framework integrating both model- and data-centric perspectives: model origin identification, architectural and mechanistic analysis, training data attribution, and external information verification. We introduce a novel “prior–posterior” dual-paradigm classification system and unify techniques including model fingerprinting, response-level verification, and traceability-aware embedding to support both proactive and reactive reasoning. The framework systematically consolidates fragmented provenance research efforts, significantly enhancing the explainability, verifiability, and transparency of AI-generated content. It establishes a theoretical foundation and scalable technical methodology for detecting AI-generated content (AIGC), identifying model identities, and ensuring information reliability.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has revolutionized artificial intelligence, shifting from supporting objective tasks (e.g., recognition) to empowering subjective decision-making (e.g., planning, decision). This marks the dawn of general and powerful AI, with applications spanning a wide range of fields, including programming, education, healthcare, finance, and law. However, their deployment introduces multifaceted risks. Due to the black-box nature of LLMs and the human-like quality of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement become particularly significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation into provenance tracking for content generated by LLMs, organized around four interrelated dimensions that together capture both model- and data-centric perspectives. From the model perspective, Model Sourcing treats the model as a whole, aiming to distinguish content generated by specific LLMs from content authored by humans. Model Structure Sourcing delves into the internal generative mechanisms, analyzing architectural components that shape the outputs of model. From the data perspective, Training Data Sourcing focuses on internal attribution, tracing the origins of generated content back to the training data of model. In contrast, External Data Sourcing emphasizes external validation, identifying external information used to support or influence the responses of model. Moreover, we also propose a dual-paradigm taxonomy that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Tracking provenance of LLM-generated content to enhance transparency
Addressing hallucinations and bias through multi-perspective sourcing
Developing traceability methods for model structure and training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Provenance tracking for LLM-generated content
Model- and data-centric sourcing perspectives
Prior-based and posterior-based traceability methods
🔎 Similar Papers
No similar papers found.
Liang Pang
Liang Pang
Associate Professor, Institute of Computing Technology, Chinese Academy of Sciences
Large Language ModelSemantic MatchingQuestion AnsweringText MatchingText Generation
K
Kangxi Wu
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, Beijing, China
Sunhao Dai
Sunhao Dai
Renmin University of China
Recommender SystemsInformation RetrievalTrustworthyLarge Language Models
Z
Zihao Wei
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, Beijing, China
Zenghao Duan
Zenghao Duan
CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS
large language model
J
Jia Gu
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, Beijing, China
X
Xiang Li
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, Beijing, China
Z
Zhiyi Yin
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
J
Jun Xu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
H
Huawei Shen
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xueqi Cheng
Xueqi Cheng
Ph.D. student, Florida State University
Data miningLLMGNNComputational social science