Copyright Clash: The New York Times’ Lawsuit Against Microsoft and OpenAI Reveals Complex Challenges in AI Training

The New York Times (“NYT”) recently filed complaints against Microsoft and OpenAI, alleging copyright infringements in their use of the Times’ copyrighted materials to train ChatGPT. 

First, NYT alleged that the defendants engaged in copying substantial NYT content when building their Large Language Models. OpenAI incorporates internal corpus such as WebText, WebText 2, and external sources like Common Crawl to train the AI. These sources are often deemed a “copy of the Internet.” Defendants are accused of such “copying” without permission or payment. OpenAI asserts that their use of NYT’s works is entirely legal, constituting fair use to serve a new “transformative” purpose.

Generally, copyright provides the copyright owner with the exclusive right to reproduce copies, prepare derivatives, distribute copies, perform, or display the work publicly. Copyright also grants the copyright owner the right to authorize others to exercise these rights. Using copyright-protected works as materials for teaching or training, whether for human or artificial intelligence, falls outside the scope of protection. The Intellectual Property Clause grants Congress the power “to promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.”

Professor Oren Bracha, speaking at the UNC School of Law JOLT Symposium, suggested that NYT’s lawsuit against OpenAI might not involve fair use, raising concerns about 17 U.S.C. § 102 subject matter. The statute explicitly states that copyright protection does not extend to ideas, procedures, processes, systems, methods of operation, concepts, principles, or discoveries, regardless of how they are described. Therefore, OpenAI’s use of these works is merely part of the big data training process, and big data itself is not within the scope of protection.

In theory, since ChatGPT’s database contains entire books, a user could read the whole book by continuously requesting the ‘next paragraph’ without paying a single penny.

Additionally, the NYT claims that defendants reproduce extensive portions verbatim in their outputs, accusing OpenAI of unauthorized distribution of copyrighted material. To support this, the NYT included screenshots in the complaint, demonstrating that when a user asks ChatGPT to give a specific paragraph of an article the output is nearly identical to the original. When users inquire about the next paragraph, ChatGPT will do so, and will continue if asked. In theory, since ChatGPT’s database contains entire books, a user could read the whole book by continuously requesting the “next paragraph” without paying a single penny.

I tested this method to try to access George R. R. Martin’s A Song of Ice and Fire. Inquiring about the first paragraph, I received an accurate response from ChatGPT. However, asking for “the next paragraph” resulted in a response stating an inability to provide verbatim excerpts due to copyright restrictions. It remains unclear whether this is a timely response to the lawsuit. But currently, users cannot use ChatGPT as a free reading shortcut. Yet if NYT’s claims are timely and valid, ChatGPT’s ability to “reproduce the work in copies” may lead to a trial. The likelihood of accountability depends on the NYT’s ability to provide evidence of ChatGPT responding to users’ inquiries about such copyrighted works.

Finally, the NYT accuses the defendants of attributing false information to the NYT, raising concerns about its credibility. Unlike earlier chatbots that may respond with “I don’t know” when unable to answer questions, ChatGPT generates answers that “sound about right” but are actually entirely inconsistent with the facts. These inconsistencies may have real world consequences. A personal injury lawyer faced discipline for using ChatGPT to draft legal documents submitted to court, where the AI cited a nonexistent case. If NYT can prove that ChatGPT’s false information caused “reliance” and “harm,” among other necessary elements, it may have grounds for a suit based on fraudulent misrepresentation. This falls under a tort or contract claim, outside the scope of copyright, which we won’t discuss here.

In conclusion, the court may end up dismissing the claim of using NYT’s sources for AI training purpose. But the other claims regarding reciting copy of the material and attributing wrong information to the author could persist. As the legal battle unfolds, its outcome may shape the future landscape for AI development, data usage, and the responsibilities of tech companies in respecting intellectual property rights.

Zhe Liu

Zhe Liu is a 2L at the University of North Carolina School of Law.