The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

📅 2024-12-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the legal risks and performance impacts of training large language models (LLMs) on copyrighted material, focusing on Norwegian-language models. To address this, we develop a reproducible evaluation framework—first to systematically quantify the differential impact of three copyright-protected text categories (books, newspapers, and novels) on model performance. Our methodology integrates multi-task benchmarking, controlled-variable training experiments, copyright-text provenance tracing, and contribution attribution analysis. Results indicate that books and newspapers significantly improve performance across multiple Norwegian benchmarks, whereas novels degrade performance in several cases. We propose a novel data-impact assessment paradigm that jointly considers legal compliance and modeling efficacy, offering empirical grounding and methodological support for copyright-compliant training-data auditing and equitable author compensation mechanisms.

Technology Category

Application Category

📝 Abstract
The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.
Problem

Research questions and friction points this paper is trying to address.

Copyrighted Materials
Legal Issues
Ethical Concerns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Copyrighted Materials
Norwegian Language Model
Impact on AI Training
🔎 Similar Papers
No similar papers found.
J
Javier de la Rosa
National Library of Norway
Vladislav Mikhailov
Vladislav Mikhailov
University of Oslo
LLMNLPbenchmarking
L
Lemei Zhang
Norwegian University of Science and Technology
F
Freddy Wetjen
National Library of Norway
David Samuel
David Samuel
Language Technology Group, University of Oslo
language modelingsemantic parsingnatural language processing
P
Peng Liu
Norwegian University of Science and Technology
R
Rolv-Arild Braaten
National Library of Norway
P
Petter Maehlum
University of Oslo
M
M. B. Birkenes
National Library of Norway
Andrey Kutuzov
Andrey Kutuzov
University of Oslo
Computational LinguisticsNatural Language ProcessingDiachronic Word EmbeddingsSemantic Change DetectionMachine Learning
Tita Enstad
Tita Enstad
Forsker ved Nasjonalbiblioteket
NLPspråkteknologiML
S
S. A. Brygfjeld
National Library of Norway
J
J. Gulla
Norwegian University of Science and Technology
S
S. Oepen
University of Oslo
Erik Velldal
Erik Velldal
Professor at the University of Oslo, Dept. of Informatics, Language Technology Group
Natural Language ProcessingMachine Learning
W
Wilfred Ostgulen
National Library of Norway
L
Liljia Ovrelid
University of Oslo
A
A. Myhre
National Library of Norway