The development of modern AI systems rests on a foundation of data whose origins, permissions, and provenance are often unknown, unclear, or actively problematic. Landmark research published at ICML 2024 by Longpre, Mahari, and colleagues (arXiv:2404.12691) systematically documented how data authenticity, consent, and provenance practices in AI development are fundamentally broken. This article explores their findings and what they mean for human data rights.

The Provenance Crisis

What Is Data Provenance?

Data provenance encompasses:

Origin Information:

Where did the data come from?
Who created it?
When was it created?
What was its original purpose?

Permission Chain:

Who authorized its collection?
Under what terms was consent given?
Has that consent been honored?
What rights do original creators retain?

Transformation History:

How has the data been processed?
What modifications have been made?
How has it been combined with other data?
What has been derived from it?

Usage Tracking:

How has the data been used?
In what systems is it present?
What decisions has it influenced?
What outputs have been generated?

Why Provenance Matters

Provenance is essential for:

Legal Compliance:

Copyright requires tracking original ownership
Privacy law requires consent documentation
Licensing requires terms adherence
Litigation requires evidence of proper acquisition

Ethical Accountability:

Respecting creator intentions
Honoring consent boundaries
Ensuring fair compensation
Maintaining trust relationships

Technical Quality:

Understanding data characteristics
Identifying potential biases
Ensuring representativeness
Enabling error correction

Governance:

Auditing data practices
Enforcing policies
Meeting regulatory requirements
Demonstrating compliance

The Research Findings

Systematic Failures

The Longpre et al. research (arXiv:2404.12691) documented pervasive failures across the AI data pipeline:

Consent Failures:

Data collected under one consent framework is routinely repurposed
Terms of service changes retroactively alter usage rights
Consent obtained from platforms, not original creators
Opt-out mechanisms are ineffective or absent

Authenticity Failures:

Training data contains significant synthetic content
Attribution is lost through aggregation
Fake, manipulated, or misleading data enters datasets
Quality verification is minimal or absent

Provenance Failures:

Origin documentation is incomplete or nonexistent
Chain of custody is broken
Licensing terms are unclear or contradictory
Historical context is lost

Case Studies from the Research

Web Scraping:

Robots.txt directives are inconsistently respected
Terms of service prohibitions are ignored
Copyright notices are stripped from data
No mechanism exists for creator notification

Dataset Aggregation:

Original licenses are lost in aggregation
Incompatible licenses are mixed
Attribution requirements are ignored
Derivative work restrictions are violated

Third-Party Data:

Data purchased from aggregators has unknown provenance
Verification of consent claims is rare
Liability is poorly defined
Due diligence is minimal

User-Generated Content:

Platform terms claim broad licensing rights
Original creators have limited awareness
Consent is buried in lengthy terms of service
Opt-out after the fact is often impossible

Quantitative Findings

The research quantified the scope of problems:

40%+ of popular datasets have unverified or unclear licensing
Significant portions of training data violate stated collection policies
Majority of creators are unaware their work is used for AI training
Opt-out mechanisms, where they exist, reach less than 5% of affected individuals

Impact on Rights

Creator Rights

Creators of content used in AI training face:

Loss of Control:

Work used without knowledge or consent
Unable to prevent unwanted uses
No mechanism to withdraw consent
Derivative works created without authorization

Economic Harm:

AI systems compete with creators using their own work
No compensation for training data contribution
Difficulty proving infringement
Power asymmetry in negotiations

Attribution Loss:

Work is unattributed in training data
Original context is stripped
Collective contribution obscures individual work
Credit goes to AI companies, not creators

Individual Privacy

Individuals whose data appears in training sets experience:

Privacy Violations:

Personal information in training data
Private communications exposed
Behavioral data tracked
Sensitive information revealed

Autonomy Loss:

No meaningful consent
Unable to access or correct data
Cannot control use in AI decisions
Subject to AI outputs based on their data

Discrimination Risks:

Historical biases encoded in data
Stereotypes reinforced by training
Unfair treatment based on group data
Limited recourse for harms

Why Fixing Provenance Is Hard

Technical Challenges

Scale:

Trillions of data points in modern training
Billions of potential rights holders
Millions of potential consent interactions
Computational overhead of tracking

Aggregation:

Data combined from many sources
Original boundaries lost
Licensing incompatibility hidden
Attribution diluted

Transformation:

Processing obscures origins
Preprocessing modifies data
Embedding loses direct mapping
Model weights abstract from data

Verification:

Consent claims difficult to verify
Documentation often incomplete
Historical records unavailable
Automated verification limited

Economic Challenges

Cost:

Provenance tracking adds infrastructure cost
Licensing negotiation is expensive
Compensation distribution has overhead
Compliance is ongoing expense

Incentives:

Current practices externalize costs to creators
First-mover advantage from moving fast
Regulatory arbitrage available
Enforcement is limited

Market Structure:

Few large players dominate
Data network effects create barriers
Concentration reduces competition
Power asymmetry limits negotiation

Legal Challenges

Jurisdictional Complexity:

Different rules in different countries
Internet data crosses boundaries
Enforcement mechanisms vary
International coordination limited

Uncertain Doctrine:

Fair use arguments contested
Copyright scope for AI unclear
Contract enforcement limited
Evolving regulatory landscape

Remedies:

Damages hard to calculate
Injunctions impractical after training
Class action challenges
Individual action cost-prohibitive

Solutions and Standards

Technical Approaches

Data Documentation Standards:

Datasheets for Datasets: Structured documentation of data characteristics
Model Cards: Documentation of model training and intended use
Data Nutrition Labels: Accessible summaries of key data properties
FAIR Principles: Findable, Accessible, Interoperable, Reusable

Provenance Tracking:

Content authenticity initiatives: Adobe CAI, C2PA for media provenance
Blockchain-based tracking: Immutable record of data transactions
Cryptographic signatures: Verification of data origin and integrity
Watermarking: Embedded identification surviving transformation

Consent Management:

Machine-readable consent: Standardized permission formats
Preference expression: Do Not Train registries and protocols
Consent infrastructure: Platforms for managing AI training permissions
Revocation mechanisms: Systems for withdrawing previously granted consent

Legal and Policy Approaches

Regulatory Requirements:

EU AI Act mandates training data documentation
Proposed US legislation requires transparency
Copyright registration for AI training data
Licensing framework development

Industry Standards:

AI company transparency reports
Voluntary disclosure frameworks
Third-party audits
Certification programs

Legal Frameworks:

Copyright reform for AI context
Privacy law extension to training data
Collective rights management
International harmonization

Research on generative AI and copyright (arXiv:2502.15858) documents the evolving legal landscape around these issues.

Best Practices for Organizations

Before Collection:

Establish clear data collection policies
Document intended uses and constraints
Verify legal basis for acquisition
Plan for consent management

During Collection:

Implement provenance tracking from start
Obtain explicit consent where required
Maintain comprehensive records
Respect robots.txt and terms of service

After Collection:

Regular provenance audits
Respond to access and deletion requests
Update records as uses change
Prepare for regulatory scrutiny

For Model Development:

Document training data thoroughly
Test for problematic data influence
Enable data removal or unlearning
Maintain audit trails

What Individuals Can Do

Understand Your Rights

Current Rights:

Access requests under GDPR/CCPA for personal data
Copyright claims for creative works
Contract claims for terms of service violations
Privacy claims for unauthorized data use

Emerging Rights:

Right to explanation of AI decisions
Right to object to AI training
Right to fair compensation
Right to meaningful consent

Protect Your Data

Proactive Measures:

Review terms of service before using platforms
Use opt-out mechanisms where available
Add Do Not Train directives where supported
Register copyrights for valuable works

Documentation:

Keep records of your creations
Document publication dates
Track where your content appears
Save evidence of unauthorized use

Support Systemic Change

Advocacy:

Support data rights organizations
Engage with policy development
Participate in public comments
Contact representatives

Collective Action:

Join creator organizations
Support litigation funds
Participate in class actions
Build coalitions

The Path Forward

Near-Term (2026-2027)

Standards Development:

Industry groups developing provenance standards
Technical specifications maturing
Pilot implementations launching
Best practices emerging

Regulatory Implementation:

EU AI Act requirements taking effect
State laws like Colorado implementing
Enforcement actions establishing precedent
Guidance documents clarifying expectations

Medium-Term (2027-2030)

Infrastructure Building:

Consent management platforms maturing
Provenance tracking becoming standard
Licensing frameworks operating
Compensation systems functioning

Market Evolution:

Ethically-sourced data as differentiator
Premium for clear provenance
Consumer awareness increasing
Creator organizations strengthening

Long-Term (2030+)

Comprehensive Framework:

Global provenance standards
Effective consent infrastructure
Fair compensation systems
Transparent AI development

Frequently Asked Questions

Q: How do I know if my data was used to train an AI?

A: Currently, this is very difficult to determine. AI companies rarely disclose specific training data. Some researchers have developed “membership inference” techniques, but these are imperfect. Stronger transparency requirements are needed.

Q: Can I opt out of future AI training?

A: Increasingly, yes. Many AI companies offer opt-out mechanisms, and standards like Do Not Train are emerging. However, data already used in training cannot typically be fully removed.

Q: What makes data provenance “broken”?

A: The Longpre et al. research found that consent is often absent or invalid, origin documentation is incomplete, licensing is unclear, and there’s no effective way to track data through the AI pipeline.

Q: How is this different from traditional copyright issues?

A: AI training involves unprecedented scale (trillions of data points), transformation that makes identification difficult, and collective use that obscures individual contributions. Traditional frameworks weren’t designed for this context.

Q: What should AI companies do differently?

A: Implement provenance tracking from collection through training. Obtain clear consent. Document data sources thoroughly. Provide effective opt-out mechanisms. Support compensation systems.

Conclusion

The broken state of data provenance in AI development is not an inevitable technical reality but a consequence of choices made in pursuit of rapid advancement. As the Longpre, Mahari, et al. research conclusively demonstrates, current practices fail to respect consent, track authenticity, or maintain provenance—with profound implications for creator rights, individual privacy, and accountability.

Fixing these problems requires coordinated action across technical, legal, and social domains. Standards must be developed and adopted. Laws must be enacted and enforced. Organizations must change practices. Individuals must be empowered.

The Human Data Rights Coalition views proper data provenance as foundational to all other data rights. Without knowing where data comes from and how it’s used, there can be no meaningful consent, no fair compensation, and no effective accountability. Building provenance infrastructure is thus a priority for the movement.

This analysis is based on research published in arXiv:2404.12691 and related work. For the complete technical findings, consult the original papers.

Data Provenance for AI: Why It Matters and How to Fix It