Overview
America’s frontier artificial intelligence (AI) labs have faced years of copyright litigation over their ingestion of books, news, and other expressive works. Yet, until this week, no US court had squarely applied the four statutory fair use factors to the merits of those claims. That changed when two US District Court for the Northern District of California judges issued detailed summary judgment opinions addressing how much latitude copyright law gives developers who copy entire books to train large language models (LLMs). Taken together, the two decisions trace an emerging framework: Training is likely fair use when the source texts are lawfully obtained and the record contains no concrete evidence of market substitution. Conversely, bad‑faith acquisition remains actionable even when the downstream use is “spectacularly” transformative.
In Depth
Kadrey v. Meta: A complete victory for Meta built on evidentiary gaps
Thirteen prominent authors, including Sarah Silverman, Ta‑Nehisi Coates, Rachel Louise Snyder, Junot Díaz, and Pulitzer winner Andrew Sean Greer, sued Meta for copying their books from “shadow libraries” to train its Llama models. After discovery limited to the named plaintiffs’ works, the authors moved for partial summary judgment. Meta cross‑moved.
The court granted Meta’s motion and dismissed the copyright claim. Applying the § 107 fair use factors, the court reasoned that:
- Factor 1: Purpose and character. Using books to teach an LLM how to model linguistic relationships is “highly transformative,” because the model is deployed to draft emails, translate text, write code, and perform myriad tasks far removed from reading a book for entertainment or study. The commercial motive was acknowledged but did not override the degree of transformation.
- Factor 2: Nature of the work. The books are quintessentially creative, yet this factor “rarely controls” and carried little weight.
- Factor 3: Amount copied. Copying entire books was “reasonably necessary” to achieve the transformative purpose. Large, coherent blocks of high‑quality prose make LLMs better at handling long‑context prompts.
- Factor 4: Market effect. This point was decisive. The plaintiffs offered no admissible evidence that Llama outputs substitute for, or dilute sales of, their books. Tests showed Llama could not reproduce more than 50 tokens from any of the plaintiffs’ texts, even under adversarial prompting. Their second theory – that Meta’s unpaid use destroys a hypothetical “training‑data licensing” market – was rejected as circular because copyright law does not guarantee the right to monetize every conceivable downstream use.
The order binds only the 13 plaintiffs and leaves intact a separate count alleging distribution liability for Meta’s torrenting of the shadow‑library files. The court explicitly cautioned that “this ruling does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful. It stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one.”
Bartz v. Anthropic: A split decision that isolates piracy from transformative training
Five authors accused Anthropic of copying millions of books that were purchased, scanned, and pirated to train the Anthropic Claude models. With the court’s permission, Anthropic moved early for summary judgment on fair use. The court divided the challenged conduct into three distinct uses:
- Training copies (fair use). Feeding tokenized or compressed versions of books into model training is “spectacularly transformative.” Because the plaintiffs alleged no infringing outputs, the purpose, amount, and absence of market substitution compelled judgment for Anthropic.
- Print‑to‑digital conversions (fair use). Destroying lawfully purchased hard backs and retaining searchable PDFs inside an internal research library is a classic, non‑substitutive format shift. The court found no market harm and little relevance to the creative nature of the works.
- Pirated “forever” library (not fair use). In contrast, downloading roughly seven million e‑books from LibGen, Pirate Library Mirror, and other sites to build a permanent, general‑purpose corpus was held not to be fair use. The court stressed that the acquisition displaced ordinary sales and that “were the conduct to be condoned as a fair use . . . [a]s Anthropic itself suggested, ‘[t]hat would destroy the [entire] publishing market.’” Anthropic now faces a jury trial limited to damages for its pirated library copies.
Comparative implications
Taken together, the opinions draw three conclusions:
- LLM training is, no doubt, a transformative use. Both courts ruled that LLM training was highly transformative.
- Lawful sourcing matters. Bartz shows that pirating source material can override a fair use defense even when the end use is transformative.
- The fourth fair use factor is the new battleground. Meta prevailed because the authors offered no data tying Llama outputs to lost book sales or licensing value. But plaintiffs who marshal surveys, sales data, or dilution studies could tip the scales next time.
What comes next?
Appeals are virtually certain. The US Court of Appeals for the Ninth Circuit will be asked to decide whether every invisible back‑end copy must clear fair use independently (Bartz) and whether dilution without direct substitution can defeat fair use (Kadrey). Until clearer appellate law arrives, companies should assume that training on text obtained from dubious sources remains a high‑risk activity and that a lack of market‑harm evidence may be only a temporary shield.