Shaping Policy to Protect Data Rights in AI Development

Posted April 2025

Introduction

Generative AI relies on three main components: data, compute power, and algorithms, often called the "AI triad." While compute and algorithms are equally important, data is the foundation, without which AI models are obsolete. But where does all this data come from? And do AI labs have the right to use it?

Most AI systems are trained on massive datasets scraped from the internet. This data includes vast amounts of text, images and videos, often disregarding copyrights or consent. This raises serious legal and fair-use concerns.

Recent high-profile lawsuits show the risks of overlooking these issues. For example, Getty Images sued Stability AI for allegedly using more than 12 million photos without a license to train its AI image-generation system, Stable Diffusion. Getty claims that, besides unauthorised use, copyright metadata was removed and fake images with Getty's watermarks were created, posing a risk to their reputation. Other artists and creators have taken similar legal action. Additionally, major AI companies like Meta, OpenAI, and Microsoft, behind traditional large models, have been sued for using copyrighted material without proper consent.

These cases signal a broader legal and governance shift: AI companies can no longer assume that online data is free for unrestricted use. In the absence of clear regulations, mass data scraping violates the creators' rights, puts companies at legal risk and undermines trust in AI systems.

This writing argues for immediate governance to regulate unauthorised use of data in training AI models. Such governance must ensure transparency, protect data rights and support ethical AI development.

"Copyright law is a sword that's going to hang over the heads of AI companies for several years unless they figure out how to negotiate a solution."
Daniel Gervais, co-director of the intellectual property program at Vanderbilt University

How AI Models Train

The core of Generative AI training is massive amounts of data. This data comes from web crawlers all over the internet, which copy billions of web pages. Articles, images, videos, everything is downloaded and fed into huge training datasets.

One widely used web crawler is CCBot. It is run by the non-profit Common Crawl, which compiles massive web archives. These datasets are later used to create other training sets like LAION, which is widely used in image model development. OpenAI has confirmed that most of its training data comes from Common Crawl.

However, publicly available data or data collected by a non-profit doesn't mean that it is free to use without restriction. Such data is unstructured and rarely checked for ownership or permission.

In 2023, OpenAI launched its own crawler, GPTBot, to scrape online content for AI training. It claims to avoid paywalled or private sites and allows website owners to opt out using the robots.txt file. But most creators are unaware of this file or how to use it.

Some datasets go further. Ongoing lawsuits against OpenAI's ChatGPT and Meta's LLaMA allege that these models were also trained on illegally obtained datasets from shadow libraries such as Z-library and Bibliotik. Meta research has cited ThePile, a dataset created by EleutherAI, which openly admits to including data from these shadow libraries.

The Atlantic developed a search tool which revealed that over 7.5 million books are hosted on LibGen, an illegal pirate site. Data from LibGen has reportedly been used by Meta and other AI companies to train their models.

Companies like Anthropic have justified such approaches. They argued that licensing such vast data is impossible and that their data comes from publicly available and non-profit datasets.

The legality of this entire process now depends on how courts interpret copyright and data rights in the AI era.

What Copyright Law Does (and Doesn't) Cover

In the U.S., copyright law is shaped by the Fair Use Doctrine. This permits limited, unlicensed use of copyrighted materials without permission. It applies for purposes like criticism, comment, news reporting, teaching, scholarship, and research.

Whether a use qualifies as fair depends on four factors:

The purpose of the use
The nature of the original/copyrighted work
The amount of the content used
The impact of the use on the original work's market value

There is no fixed rule. It's a case-by-case decision, and courts have a lot of freedom to determine whether a use is 'transformative' (adding new meaning or value) or merely 'derivative' (copying).

AI companies have exploited this ambiguity to justify training models on copyrighted works without explicit consent.

For example, in Warhol Foundation v. Lynn Goldsmith (2023), the court ruled that Warhol's portrait was not transformative. It was based closely on Goldsmith's original photo and remained commercial. So, it did not qualify as fair use. By contrast, in Authors Guild v. Google (2015), the court found Google's scanning and snippet display of books to be transformative. It made information about the books more accessible. Interested users would have to purchase the books to read them.

The question remains: is training a model on copyrighted content fair use?

In Thomson Reuters v. Ross Intelligence Inc., the court said no. Ross built a legal AI tool using Westlaw's editorial content, without a license. Even though the AI didn't reproduce the content directly, the court said the use was commercial, non-transformative, harming Westlaw's market.

This case signals that courts may not allow companies to freely use copyrighted material for AI training, especially when it competes with the original creator's business.

In the European Union, copyright laws are similar but somewhat more structured. The Copyright Directive includes two exceptions for text and data mining (TDM):

Research and cultural organisations may mine data for non-commercial use
General use mining is allowed only if the right-owners have not opted out

That opt-out clause is important. It gives content owners some power to say no. It slows down how freely commercial developers can train on web content.

While the EU's rules are more transparent than U.S. law, both systems show how unprepared current copyright frameworks are for generative AI's scale and speed.

Governance Gaps

Even when laws offer some protection, like EU opt-outs or the U.S. fair use, the systems to actually enforce or respect those protections are often missing.

The EU AI Act calls for transparency. Companies are asked to report how they collect data and where it comes from. This should prevent unauthorised data, like personal and copyrighted data, from being used for training purposes. But it's all self-reported. There is no system in place to make sure this is followed. Similarly, while Directive (EU) 2019/790 allows rights holders to opt out of text and data mining, it doesn't define how this opt-out should be registered, detected, or enforced.

In the UK, the IPO had to withdraw a proposal that allowed text and data mining after being criticised for being misguided and harmful to creative industries. This shows that even lawmakers aren't sure how to govern this yet.

A recent paper, Open Problems in Technical AI Governance, points to a deeper issue. We are still in the process of developing the infrastructure to verify what data was used, where it came from, or whether it was licensed.

Even Creative Commons, a supporter of open access, acknowledged these governance gaps in its response to the U.S. Copyright Office. While it argued that AI training can fall under fair use, it also stressed that copyright law alone is not enough to handle the scale of generative AI. That's the bigger issue. What's missing isn't just legal clarity. It is the technical and policy framework that gives creators visibility, choice, and enforcement power. Right now, there is no standard way to express opt-outs, no requirement to document datasets and no tools to check whether a creator's content was used in training.

Emerging Technical Tools to Protect Data and Verify AI Training

Before discussing new policies and governance, it's important to note some early technical solutions helping protect creators' rights and improve transparency.

Data provenance systems like W3C's PROV track where the data comes from and who owns it. This can help with the verification of datasets. Creators can also label content with metadata (e.g., IPTC/PLUS "Data Mining" tags) to signal permissions.

Tools like the Glaze Project apply subtle changes to artworks, making it harder for AI to copy styles.

Researchers have proposed "proof-of-training data" methods to show what data trained a model, though these require deep access. Membership inference attacks can detect if specific data was included in training which can help with audits.

These tools are promising and need policy support and cooperation to work effectively.

How We Could Govern AI Training Data

To govern AI effectively, training data requires a combination of technical solutions and clear policy frameworks. This will ensure transparency, respect for creators' rights, and accountability.

Concrete Policy Proposals

Mandatory Dataset Documentation: AI labs should be required to document their training datasets, including sources, licences, and permissions. This transparency promotes oversight and accountability.
Standardised Opt-Out Mechanisms: Content creators must have accessible ways to declare whether their work can be used for AI training. For example, through recognised metadata tags like IPTC/PLUS "Data Mining" properties or licensing frameworks such as Creative Commons. AI systems and crawlers would be legally obliged to respect these signals. Additionally, the laws can push for a default opt-out mechanism instead of a default opt-in.
Licensing and Compensation Frameworks: Governments should explore legal frameworks that require AI companies to obtain licences or pay royalties for copyrighted materials included in training datasets.
Audit and Enforcement: Independent regulators should conduct regular audits of AI training data compliance. This can be done via technical tools such as membership inference attacks, referenced earlier. Non-compliance could lead to penalties or restrictions on deploying models trained on unauthorised data.

Stakeholders

Governments and Regulators create and enforce legal frameworks.
AI Companies must implement proper data governance practices, respect opt-outs, document datasets transparently, and cooperate with regulators.
Creators and Rights Holders should be empowered with tools, training and education to declare data use preferences and seek help when violated.
Civil Society and Auditors play a critical role in monitoring compliance, checks and advocating for creators' rights, and increasing public awareness.

Implementation Challenges and Trade-Offs

Stricter data governance may increase costs for AI development and potentially slow innovation, especially for smaller companies lacking resources.
Policies must balance protecting creators with allowing reasonable use for research, innovation, and public benefit. For example, exemptions could be made for non-commercial or academic research uses.
Phased implementation of rules along with feedback loops can help manage unintended reactions and improve regulations over time.

International Coordination

AI development is global, and international cooperation is vital.
Countries should work towards setting international standards and committees to govern these issues.

Conclusion

The debate over AI training data goes beyond copyright law. It's about consent, transparency, and building systems that govern technology responsibly.

The lawsuits make one thing clear: scraping the internet for data without consent or oversight is no longer sustainable or ethical. We need real governance, both technical and institutional. That means better documentation. Clear opt-outs. Audits that actually happen. And rules that don't just protect AI companies, but also protect the artists, writers, and users whose work controls these models.

Sources

Getty Images v. Stability AI lawsuit
The New York Times v. OpenAI and Microsoft
Sarah Silverman et al. v. OpenAI and Meta
Warhol Foundation v. Lynn Goldsmith (2023)
Authors Guild v. Google (2015)
Thomson Reuters v. Ross Intelligence Inc.
EU AI Act and Directive (EU) 2019/790
Open Problems in Technical AI Governance (2023)
Creative Commons submission to the U.S. Copyright Office
Glaze and Nightshade Projects
Choi et al. (2023), "Proof-of-Training Data"
Membership Inference Attacks (various papers)

Note: Edited with the help of AI tools. All research and writing decisions are my own.