Copyright law and the lifecycle of machine learning models

Kretschmer, M. , Margoni, T. and Oruc, P. (2024) Copyright law and the lifecycle of machine learning models. International Review of Intellectual Property and Competition Law, 55(1), pp. 110-138. (doi: 10.1007/s40319-023-01419-3)

[img] Text
316749.pdf - Published Version
Available under License Creative Commons Attribution.

457kB

Abstract

Machine learning, a subfield of artificial intelligence (AI), relies on large corpora of data as input for learning algorithms, resulting in trained models that can perform a variety of tasks. While data or information are not subject matter within copyright law, almost all materials used to construct corpora for machine learning are protected by copyright law: texts, images, videos, and so on. There are global policy moves to address the copyright implications of machine learning, in particular in the context of so-called “foundation models” that underpin generative AI. This paper takes a step back, exploring empirically three technological settings through detailed case studies. We set out the established industry methodology of a lifecycle of AI (collecting data, organising data, model training, model operation) to arrive at descriptions suitable for legal analysis. This will allow an assessment of the challenges for a harmonisation of rights, exceptions and disclosure under EU copyright law. The three case studies are: 1. Machine learning for scientific purposes, in the context of a study of regional short-term letting markets; 2. Natural Language Processing (NLP), in the context of large language models; 3. Computer vision, in the context of content moderation of images. We find that the nature and quality of data corpora at the input stage is central to the lifecycle of machine learning. Because of the uncertain legal status of data collection and processing, combined with the competitive advantage gained by firms not disclosing technological advances, the inputs of the models deployed are often unknown. Moreover, the “lawful access” requirement of the EU exception for text and data mining may turn the exception into a decision by rightholders to allow machine learning in the context of their decision to allow access. We assess policy interventions at EU level, seeking to clarify the legal status of input data via copyright exceptions, opt-outs or the forced disclosure of copyright materials. We find that the likely result is a fully copyright-licensed environment of machine learning that may have problematic effects for the structure of industry, innovation and scientific research.

Item Type:Articles
Additional Information:The research was funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 870626870626 (reCreating Europe: Rethinking digital copyright law for a culturally diverse, accessible, creative Europe). Case study 1 was developed with ESRC support for the Urban Big Data Centre (ES/L011921/1). Pinar Oruc¸ prepared a first draft of the case studies as a postdoctoral researcher at CREATe, University of Glasgow.
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Margoni, Dr Thomas and Kretschmer, Professor Martin and Oruc, Dr Pinar
Authors: Kretschmer, M., Margoni, T., and Oruc, P.
College/School:College of Social Sciences > School of Law
Journal Name:International Review of Intellectual Property and Competition Law
Publisher:Springer
ISSN:0018-9855
ISSN (Online):2195-0237
Published Online:01 February 2024
Copyright Holders:Copyright © 2024 The Authors
First Published:First published in International Review of Intellectual Property and Competition Law 55(1):110-138
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
190698Urban Big Data Research CentreNick BaileyEconomic and Social Research Council (ESRC)ES/L011921/1S&PS - Urban Big Data