By Michael Borella --
On December 27, 2023, The New York Times Company ("The Times") sued several OpenAI entities and their stakeholder Microsoft ("OpenAI") in the Southern District of New York for copyright infringement, vicarious copyright infringement, contributory copyright infringement, violation of the Digital Millennium Copyright Act (DMCA), unfair competition, and trademark dilution (complaint). Unlike other high profile copyright actions brought against OpenAI (e.g., by the Author's Guild, Julian Sancton et al., Michael Chabon et al., Sarah Silverman et al., Paul Tremblay and Mona Awad, et al.), The Times' allegations exhibit a remarkable degree of specify. This will make it difficult for OpenAI to establish that (i) its generative AI models were not trained on copyrighted content of The Times, and (ii) that OpenAI was engaging in fair use if and when it did so.
The complaint centers around OpenAI's large language model (LLM) chatbot, ChatGPT. As described by The Times:
An LLM works by predicting words that are likely to follow a given string of text based on the potentially billions of examples used to train it . . . . LLMs encode the information from the training corpus that they use to make these predictions as numbers called "parameters." There are approximately 1.76 trillion parameters in the GPT-4 LLM. The process of setting the values for an LLM's parameters is called "training." It involves storing encoded copies of the training works in computer memory, repeatedly passing them through the model with words masked out, and adjusting the parameters to minimize the difference between the masked-out words and the words that the model predicts to fill them in. After being trained on a general corpus, models may be further subject to "finetuning" by, for example, performing additional rounds of training using specific types of works to better mimic their content or style, or providing them with human feedback to reinforce desired or suppress undesired behaviors.
Once trained, LLMs may be provided with information specific to a use case or subject matter in order to "ground" their outputs. For example, an LLM may be asked to generate a text output based on specific external data, such as a document, provided as context. Using this method, Defendants' synthetic search applications: (1) receive an input, such as a question; (2) retrieve relevant documents related to the input prior to generating a response; (3) combine the original input with the retrieved documents in order to provide context; and (4) provide the combined data to an LLM, which generates a natural-language response.
Put another way, the parameters of an LLM like ChatGPT can be thought of as a compressed amalgam of its training data, represented in a way that preserves the wording, grammar, and semantic meaning of the original works. When queried, ChatGPT produces output consistent with this compressed representation.
Based on publicly available information, The Times alleges that a relatively large portion of the content used to train various versions of GPT were from its web site, an estimated millions of individual works. Further, and even more compelling, The Times provides numerous samples of ChatGPT being able to generate near verbatim copies of its articles. One such example is reproduced below:
This comparison is stunning. The Times alleges that it got ChatGPT to produce the output with "minimal prompting" but did not provide a specific prompt or series of prompts that it used to do so.[1] The output suggests that prominent training data that is emphasized in the training process can be represented in a nearly-uncompressed fashion in the resulting model. Thus, even if it is hard to point to exactly where the "copy" of an article resides amongst the 1.76 trillion parameters, the existence of such a copy should not be in question.
OpenAI responded publicly to the complaint in a January 8, 2024 blog post, stating that:
Memorization is a rare failure of the learning process that we are continually making progress on, but it's more common when particular content appears more than once in training data, like if pieces of it appear on lots of different public websites. So we have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs. We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use.
Interestingly, the regurgitations The New York Times induced appear to be from years-old articles that have proliferated on multiple third-party websites. It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate. Even when using such prompts, our models don't typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.
This is a strange response. It is essentially admitting to copying The Times' articles in question, but makes the non-legal arguments of "Hey, it was just a bug," and "The Times had to work hard and manipulated our model." Like saying "the dog ate my homework," neither of these excuses are likely to hold up under scrutiny.
Why is OpenAI seemingly shooting itself in the foot regarding actual copying? Because it is putting all of its eggs in the fair use basket.
Fair use is an affirmative defense written into the copyright statute that allows limited use of copyrighted material without permission from the copyright holder. It recognizes that rigid copyright laws can stifle dissemination of knowledge. Therefore, it attempts to balance copyright holders' interests in their creative works with the public's interest in the advancement of knowledge and education. Thus, the fair use doctrine acknowledges that not all uses of copyrighted material harm the copyright owner and that some uses can be beneficial to society at large.
Even so, OpenAI has a long and uncertain road ahead of it. Fair use is a notoriously malleable four-factor test that can be applied inconsistently from court to court. Furthermore, the interpretive contours of the test have evolved since its first appearance in the statute almost 50 years ago. Even the U.S. Copyright Office admits that "[fair use] fact patterns and the legal application have evolved over time . . . ."[2]
Predicting the outcome of a fair use dispute is often a fool's errand, even for those well-versed in copyright law. For example, the Supreme Court recently found fair use in the copying of 11,500 lines of computer code but not in the artistic reproduction of a photograph.[3] The outcome of a case can ride on which fair use factors the judge or judges find to be most relevant to the facts of the case and how they interpret these factors.
Fair use might not be a legal sniff test but it comes close. Nonetheless, let's take a look at each of the factors in order to understand the difficulties that OpenAI might run into when relying on this defense.
(1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.
Courts often view unlicensed copying for nonprofit education or noncommercial purposes as more likely to be fair use than those that are for commercial gain. In doing so, courts look to whether the use is transformative, in that it changes the original work in some manner, adding new expression or meaning, and does not just replace the original use.
OpenAI runs a for-profit business and charges for end-user access to its models. Further, the examples provided by The Times are much closer to verbatim copying than any type of transformative use. Therefore, this factor weighs against OpenAI.
(2) The nature of the copyrighted work.
This factor examines how closely the use of the work aligns with copyright's goal of promoting creativity. So, using something that requires a lot of creativity, like a book, film, or music, might not strongly back up a fair use claim compared to using something based on facts, like a technical paper or a news report.
Here, OpenAI has an angle as The Times produces a great deal of news reporting and cannot claim a copyright over basic facts. However, The Times' content includes many detailed articles explaining events and other facts in its writers' ostensibly creative voices. Moreover, investigative reporting is the uncovering and tying together of facts, which requires creative effort. At best, this factor is neutral for OpenAI.
(3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole.
In considering this factor, courts examine how much and what part of the copyrighted work is used. If a significant portion is used, it is less likely to be seen as fair use. Using a smaller piece makes fair use more probable, though copying even a minute portion of a work might not qualify as fair use if it includes a critical or central part thereof.
This factor also weighs against OpenAI if we take as given The Times' allegations and evidence of almost-exact reproduction of its works.
(4) The effect of the use upon the potential market for or value of the copyrighted work.
This fourth factor may end up being the most important. The inquiry is whether the unauthorized use negatively affects the market for the copyright owner's original work. Courts look at whether the use decreases sales relating to the original work or has the potential to cause significant damage to its market if such use were to become common.
OpenAI will have a tough time establishing that it is not effectively free-riding off of The Times' investment in journalism. Especially since GPT-4 is being integrated into its minority owner Microsoft's Bing search engine. Once this integration matures, Bing will generate answers to search queries, and might not even link back to web sites (like that of The Times) from which it gleaned the underlying information used to formulate its answers. This could be devastating blow to The Times' revenue, as the company relies on subscriptions that allow users unlimited access to paywalled articles going back decades as well as advertising to these users.
To reiterate, fair use analyses are unpredictable. Judges can place virtually all of their emphasis on as little as one factor. Still, it is hard to imagine a scenario in which OpenAI wins a fair use dispute if the facts cited in the complaint hold up. A more likely result is that The Times and OpenAI quietly settle before such a decision is made.
[1] Trying to get the current versions of ChatGPT to produce this or any article from The Times is quite difficult and may not be possible. This may be due to OpenAI recently putting in place guardrails that prevent the model from producing near-verbatim output.
[2] https://www.copyright.gov/fair-use/
[3] See Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183 (2021) and Andy Warhol Found. for the Visual Arts, Inc. v. Goldsmith, 143 S. Ct. 1258 (2023).
The reason an LLM, if properly prompted will provide a copy of some of the training data is that
memorization can be part of training (see
Memorisation versus Generalisation in Pre-trained Language Model, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
Volume 1: Long Papers, pages 7564 - 7578, 2022).
Posted by: Orlando Lopez | February 20, 2024 at 08:54 AM
As to note 1, there is also the possibility that there is required a series of prompts to arrive at the output. As such, there are also contractual implications for the Times in arriving at their 'evidence.'
The bottom line here is that the Times do NOT have a strong case on the actual merits.
This will be an easy call.
Posted by: skeptical | February 20, 2024 at 09:01 AM
Early on your criticize OpenAI for making "non-legal arguments" but then later on the last fair use factor you talk about "OpenAI will have a tough time establishing that it is not *effectively free-riding* off of The Times' investment in journalism" and "Bing will ... glean[] the *underlying information* used to formulate its answers" from e.g. the NYT website. (Asterisks are my poor man's attempt at emphases here.) I don't think those are really legal arguments either.
To be clear, I have zero canines in this confrontation, but color me a bit skeptical that NYT can somehow bootstrap some inadvertent, stray instances of verbatim copying against what is clearly its real target—the non-verbatim, deliberate training that incorporates the verbatim content.
Posted by: kotodama | February 20, 2024 at 12:01 PM
Having been filed in the Southern District of NY, relevant Second Circuit precedent that might give OpenAI some comfort includes Authors Guild v. Google (2016)
Posted by: moondog | February 21, 2024 at 09:52 AM
Referencing fairuse.stanford.edu, my Co-pilot states in part:
Key Points: The court concluded that Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals.
Anyone knowing anything about how the training is actually processed at the front end can easily point to the far more extensive transformation.
As I noted - this is an easy call.
Posted by: skeptical | February 21, 2024 at 03:13 PM
While I have seen some SAY that the NY Times has a pretty good legal position, I have yet to actually see any compelling analysis by way of the FACTS PERTINENT TO AI being presented.
In this manner, the recent Warhol case is decidedly NOT on point.
The Warhol case did not pivot AT ALL on technical processing in the view of transformation. Rather, the facts of that case pivoted almost exclusively on a different factor of commercial use.
The facts in AI are much more aligned with technical transformation, and even less such technical transformation in the Google cases EASILY saw Fair Use reached.
Again, I note that this is an easy call.
Posted by: skeptical | February 23, 2024 at 07:59 AM
Seen the news?
Allegations against NYT - curious as to whether if they hold up, if this case will make it to decision (or perhaps would be a directed decision).
Posted by: skeptical | February 27, 2024 at 01:51 PM