The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

📅 2024-12-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study investigates the legal risks and performance impacts of training large language models (LLMs) on copyrighted material, focusing on Norwegian-language models. To address this, we develop a reproducible evaluation framework—first to systematically quantify the differential impact of three copyright-protected text categories (books, newspapers, and novels) on model performance. Our methodology integrates multi-task benchmarking, controlled-variable training experiments, copyright-text provenance tracing, and contribution attribution analysis. Results indicate that books and newspapers significantly improve performance across multiple Norwegian benchmarks, whereas novels degrade performance in several cases. We propose a novel data-impact assessment paradigm that jointly considers legal compliance and modeling efficacy, offering empirical grounding and methodological support for copyright-compliant training-data auditing and equitable author compensation mechanisms.

Technology Category

Application Category

📝 Abstract

The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

Problem

Research questions and friction points this paper is trying to address.

Copyrighted Materials

Legal Issues

Ethical Concerns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Copyrighted Materials

Norwegian Language Model

Impact on AI Training

🔎 Similar Papers

Inner-Probe: Discovering Copyright-related Data Generation in LLM Architecture