By Michael Borella --
We are at the beginning of what promises to be a wave (potentially a tsunami) of complaints filed against the companies behind generative AI models (e.g., OpenAI). Recent lawsuits from Paul Tremblay and Mona Awad (Tremblay and Awad v. OpenAI Inc. et al. -- Northern District of California, No. 3:23-cv-03223), Sarah Silverman (Silverman v. OpenAI, Inc. -- Northern District of California, No. 3:23-cv-03416-AMO), and the Authors Guild (Authors Guild et al v. OpenAI Inc. et al. -- Southern District of New York, No. 1:23-cv-8292)[1] contend that OpenAI and others have hoovered up thousands of copyrighted publications, including those of the named plaintiffs, and used them to train large language models (LLMs) such as GPT-4. As these initial cases proceed, and possibly go up on appeal, they are likely to define the contours of how copyright law applies to the new world of generative AI and whether is it proper to train such models on copyrighted works without permission to do so.
The authors' theories of infringement vary as do their ancillary claims. While acknowledging the risk of over-simplify complex issues, we can boil the merits of these cases down to two main questions:
1) Is the ingestion of a copyrighted work into the training process of an LLM without the author's permission an infringement of the copyright?
2) What if an LLM trained in this fashion produces a new work that is substantially similar to the copyrighted work?
These questions can be thought in terms of pigs and sausage.[2] Pigs can be turned into sausage, but it is generally accepted to be impossible to turn sausage back into a pig. Mathematicians would consider the transformation from pig to sausage to be an irreversible one-way function.
It is important to understand that all computer data are just organized collections of numbers. This includes digital copies of books, images, audio, video, web sites, etc. When a machine learning model such as an LLM is trained on a digital book, the arrangement of numbers representing the words, punctuation, front matter, and so on are transformed into a different arrangement of numbers -- weights in a complex set of neural networks.
In most cases, there is no one-to-one mapping between the numbers used before and after transformation. One cannot point to a particular set of numbers in an LLM and identify a Game of Thrones novel. Indeed, the weights in an LLM are a complex amalgam of most or all data on which it was trained. Even the entities that design and build LLMs have yet to provide an understanding of what the weights actually represent.
So this leads to a likely answer to the first question. A similar set of facts were considered by the Second Circuit in Authors Guild, Inc. v. Google, Inc., in the context of using copyrighted books for search purposes. The Court ultimately ruled that the conversion of the copyrighted content into a form useful for searching was highly transformative, displaying small portions of the books was fair use, and such search and display did not provide a significant market substitute for the original works. Therefore, the mere use of a copyrighted work to train an LLM, even without permission, is unlikely to be a winning fact pattern.
But the emergent magic of LLMs is that they might know enough about an ingested Games of Thrones novel to be able to produce its plot summary, a list of main characters, and even quote a section or two.[3] These uses might also fall under the Second Circuit's definition of fair use.
But an LLM may be able to produce significant portions of the work or the work as a whole.[4] Or, the LLM may be able to generate alternative endings to the novel, new works in the style of the author, or new works involving the same characters and relying on the authors' world-building.
Thus, the answer to the second question is not clear, though it seems that the LLM would have to provide "more than just a little" of the copyrighted work. For example, copyright famously protects actual works and not styles. This issue may boil down to whether an LLM can reverse the transformation function and turn sausage back into a reasonable semblance of a pig, as well as whether an LLM operator can successfully prevent it from doing so.
As noted, the cases currently being litigated may provide some clarity -- or, depending on how they proceed, maybe not. Also, Congress may step in and define new causes of action that specifically target LLMs and similar fact patterns.
Authors may ultimately have their strongest positions where they can argue that the operator of the LLM is unjustly enriching itself on the backs of the authors' labor or effectively competing in the same marketplace as the authors. At first blush it seems that imaging tools based on generative AI (e.g., DALL-E, Midjourney, and others), the use of which can eliminate the need for human illustrators, might be a better target for such claims.
[1] Here, the group of authors named in the complaint include John Grisham, George R. R. Martin, Jodi Picoult, and Scott Turow.
[2] Vegans should feel free to replace "pigs" with "plant-based protein."
[3] OpenAI appears to be aware of the issues that this capability might raise. If you ask ChatGPT 4 to "provide a Jon Snow quote from Game of Thrones," it falls back on a Bing search to do so.
[4] This is theoretically possible, though OpenAI and others have put guardrails in place in attempts to prevent their models from such blatant infringement.
I'll reserve judgment about unjust enrichment and other non-copyright claims, but at least under copyright (and putting aside any incidental copying as part of the training process), when #1 is posed as the question of whether mere training on copyrighted works infringing, the answer is very straightforward: No. It's no different than if, as an aspiring fantasy author, I started my writing preparations by studying up on every JK Rowling book. Would the mere act of my studying infringe JK Rowling's copyrights? The answer is clearly no. I don't see why that logic wouldn't apply to the LLM training context either.
Likewise, and continuing the JK Rowling hypo, the answer to #2 is also straightforwardly "Yes". If, after completing my preparations, I produce something that's substantially similar to a JK Rowling work, then of course that was an infringement.
Maybe there is something fundamentally unfair about how LLMs use authors' works on a vast scale, and again, I'll reserve comment on that, but not every fundamental unfairness that also happens to involve copyrighted works necessarily results in copyright infringement.
-kd
Posted by: kotodama | November 13, 2023 at 06:12 AM
Eliminating the need - if Fair Use is found at the front end - has a very different set of actors (it would be the user of the AI engine, not the builder of that engine).
Divide and Conquer may be an apt analogy.
Posted by: skeptical | November 13, 2023 at 09:51 AM