Shaping Policy to Protect Data Rights in AI Development

Introduction

Generative AI relies on three main components: data, compute power, and algorithms, often called the "AI triad." While compute and algorithms are equally important, data is the foundation, without which AI models are obsolete. But where does all this data come from? And do AI labs have the right to use it?

Most AI systems are trained on massive datasets scraped from the internet. This data includes vast amounts of text, images and videos, often disregarding copyrights or consent. This raises serious legal and fair-use concerns.

Recent high-profile lawsuits show the risks of overlooking these issues. For example, Getty Images sued Stability AI for allegedly using more than 12 million photos without a license to train its AI image-generation system, Stable Diffusion. Getty claims that, besides unauthorised use, copyright metadata was removed and fake images with Getty's watermarks were created, posing a risk to their reputation. Other artists and creators have taken similar legal action. Additionally, major AI companies like Meta, OpenAI, and Microsoft, behind traditional large models, have been sued for using copyrighted material without proper consent.

These cases signal a broader legal and governance shift: AI companies can no longer assume that online data is free for unrestricted use. In the absence of clear regulations, mass data scraping violates the creators' rights, puts companies at legal risk and undermines trust in AI systems.

This writing argues for immediate governance to regulate unauthorised use of data in training AI models. Such governance must ensure transparency, protect data rights and support ethical AI development.

"Copyright law is a sword that's going to hang over the heads of AI companies for several years unless they figure out how to negotiate a solution."

Daniel Gervais, co-director of the intellectual property program at Vanderbilt University

How AI Models Train

The core of Generative AI training is massive amounts of data. This data comes from web crawlers all over the internet, which copy billions of web pages. Articles, images, videos, everything is downloaded and fed into huge training datasets.

One widely used web crawler is CCBot. It is run by the non-profit Common Crawl, which compiles massive web archives. These datasets are later used to create other training sets like LAION, which is widely used in image model development. OpenAI has confirmed that most of its training data comes from Common Crawl.

However, publicly available data or data collected by a non-profit doesn't mean that it is free to use without restriction. Such data is unstructured and rarely checked for ownership or permission.

In 2023, OpenAI launched its own crawler, GPTBot, to scrape online content for AI training. It claims to avoid paywalled or private sites and allows website owners to opt out using the robots.txt file. But most creators are unaware of this file or how to use it.

Some datasets go further. Ongoing lawsuits against OpenAI's ChatGPT and Meta's LLaMA allege that these models were also trained on illegally obtained datasets from shadow libraries such as Z-library and Bibliotik. Meta research has cited ThePile, a dataset created by EleutherAI, which openly admits to including data from these shadow libraries.

The Atlantic developed a search tool which revealed that over 7.5 million books are hosted on LibGen, an illegal pirate site. Data from LibGen has reportedly been used by Meta and other AI companies to train their models.

Companies like Anthropic have justified such approaches. They argued that licensing such vast data is impossible and that their data comes from publicly available and non-profit datasets.

The legality of this entire process now depends on how courts interpret copyright and data rights in the AI era.

What Copyright Law Does (and Doesn't) Cover

In the U.S., copyright law is shaped by the Fair Use Doctrine. This permits limited, unlicensed use of copyrighted materials without permission. It applies for purposes like criticism, comment, news reporting, teaching, scholarship, and research.

Whether a use qualifies as fair depends on four factors:

There is no fixed rule. It's a case-by-case decision, and courts have a lot of freedom to determine whether a use is 'transformative' (adding new meaning or value) or merely 'derivative' (copying).

AI companies have exploited this ambiguity to justify training models on copyrighted works without explicit consent.

For example, in Warhol Foundation v. Lynn Goldsmith (2023), the court ruled that Warhol's portrait was not transformative. It was based closely on Goldsmith's original photo and remained commercial. So, it did not qualify as fair use. By contrast, in Authors Guild v. Google (2015), the court found Google's scanning and snippet display of books to be transformative. It made information about the books more accessible. Interested users would have to purchase the books to read them.

The question remains: is training a model on copyrighted content fair use?

In Thomson Reuters v. Ross Intelligence Inc., the court said no. Ross built a legal AI tool using Westlaw's editorial content, without a license. Even though the AI didn't reproduce the content directly, the court said the use was commercial, non-transformative, harming Westlaw's market.

This case signals that courts may not allow companies to freely use copyrighted material for AI training, especially when it competes with the original creator's business.

In the European Union, copyright laws are similar but somewhat more structured. The Copyright Directive includes two exceptions for text and data mining (TDM):

That opt-out clause is important. It gives content owners some power to say no. It slows down how freely commercial developers can train on web content.

While the EU's rules are more transparent than U.S. law, both systems show how unprepared current copyright frameworks are for generative AI's scale and speed.

Governance Gaps

Even when laws offer some protection, like EU opt-outs or the U.S. fair use, the systems to actually enforce or respect those protections are often missing.

The EU AI Act calls for transparency. Companies are asked to report how they collect data and where it comes from. This should prevent unauthorised data, like personal and copyrighted data, from being used for training purposes. But it's all self-reported. There is no system in place to make sure this is followed. Similarly, while Directive (EU) 2019/790 allows rights holders to opt out of text and data mining, it doesn't define how this opt-out should be registered, detected, or enforced.

In the UK, the IPO had to withdraw a proposal that allowed text and data mining after being criticised for being misguided and harmful to creative industries. This shows that even lawmakers aren't sure how to govern this yet.

A recent paper, Open Problems in Technical AI Governance, points to a deeper issue. We are still in the process of developing the infrastructure to verify what data was used, where it came from, or whether it was licensed.

Even Creative Commons, a supporter of open access, acknowledged these governance gaps in its response to the U.S. Copyright Office. While it argued that AI training can fall under fair use, it also stressed that copyright law alone is not enough to handle the scale of generative AI. That's the bigger issue. What's missing isn't just legal clarity. It is the technical and policy framework that gives creators visibility, choice, and enforcement power. Right now, there is no standard way to express opt-outs, no requirement to document datasets and no tools to check whether a creator's content was used in training.

Emerging Technical Tools to Protect Data and Verify AI Training

Before discussing new policies and governance, it's important to note some early technical solutions helping protect creators' rights and improve transparency.

Data provenance systems like W3C's PROV track where the data comes from and who owns it. This can help with the verification of datasets. Creators can also label content with metadata (e.g., IPTC/PLUS "Data Mining" tags) to signal permissions.

Tools like the Glaze Project apply subtle changes to artworks, making it harder for AI to copy styles.

Researchers have proposed "proof-of-training data" methods to show what data trained a model, though these require deep access. Membership inference attacks can detect if specific data was included in training which can help with audits.

These tools are promising and need policy support and cooperation to work effectively.

How We Could Govern AI Training Data

To govern AI effectively, training data requires a combination of technical solutions and clear policy frameworks. This will ensure transparency, respect for creators' rights, and accountability.

Concrete Policy Proposals

Stakeholders

Implementation Challenges and Trade-Offs

International Coordination

Conclusion

The debate over AI training data goes beyond copyright law. It's about consent, transparency, and building systems that govern technology responsibly.

The lawsuits make one thing clear: scraping the internet for data without consent or oversight is no longer sustainable or ethical. We need real governance, both technical and institutional. That means better documentation. Clear opt-outs. Audits that actually happen. And rules that don't just protect AI companies, but also protect the artists, writers, and users whose work controls these models.

Sources

  • Getty Images v. Stability AI lawsuit
  • The New York Times v. OpenAI and Microsoft
  • Sarah Silverman et al. v. OpenAI and Meta
  • Warhol Foundation v. Lynn Goldsmith (2023)
  • Authors Guild v. Google (2015)
  • Thomson Reuters v. Ross Intelligence Inc.
  • EU AI Act and Directive (EU) 2019/790
  • Open Problems in Technical AI Governance (2023)
  • Creative Commons submission to the U.S. Copyright Office
  • Glaze and Nightshade Projects
  • Choi et al. (2023), "Proof-of-Training Data"
  • Membership Inference Attacks (various papers)

Note: Edited with the help of AI tools. All research and writing decisions are my own.

← Back to Blog