Unbelievable Scale of AI’s Pirated-Books Problem

Exactly. Along with all the other “AI” companies who stole their training data. The eventual lawsuits, even if they lose, will amount to a drop in the bucket compared to their profits. This is why I’m glad for DeepSeek and other the Chinese open-sourced “AI” programs that will just keep stealing the code from Western companies and releasing it as open-source, thus completely undermining the Western companies’ ability to profit off creatives’ stolen IP. It does nothing to get the creatives the justice they deserve, but at least it’ll damage or destroy a few of the parasitic companies trying to profit off global-scale IP theft.
I'm doubtful abou Deepseek, I've heard rumors there's a lot of smoke and mirrors involved. In short, it wasn't as cheap to train and neither is as advanced as originally advertised.
 

log in or register to remove this ad

This is why I’m glad for DeepSeek and other the Chinese open-sourced “AI” programs that will just keep stealing the code from Western companies and releasing it as open-source, thus completely undermining the Western companies’ ability to profit off creatives’ stolen IP.

I have not looked into it DeepSeek, but, as a general note - for generative AI, the source code does not reflect the trained state of the system in operation. I suppose you can hand someone the full data for a trained system, but the bulk of it isn't "code" per se.
 

In lieu of $impossible fines, the outcome that is at least reasonably possible would be the corporate death penalty for Meta with all of its intellectual property assets being forfeited to the public domain. Possible? Sure. Likely? No.

Another outcome that is reasonably possible is:
1) Stop using the trained form of the generative AI.
2) They may keep the underlying code.
3) Rather than pay fines, they must create a new training data set, and retrain - and they pay for all the bits of it this time, keeping an audit trail to prove it.

If publishers and authors/creators are smart, they license using something like a royalty structure, creating a long-term revenue stream for creators and publishers so long as genAIs trained on their work are in operation.
 

And while perhaps Meta should be required to pay a fine to the rights holders for every piece of material they accessed illegally, even locating all of the rights holders would be an impossible task (since even if we can identify the authors of each piece of copyrighted material they infringed upon, we have to track down each author AND if copyright was transferred, such as in the case of a work-for-hire, we have to figure out to whom the copyright was transferred including in cases where copyright may have been transferred multiple times), much less compensating them.
There's no "we"--we don't have to do anything. We put the burden on Meta. Make them pay into a fund which copyright holders can claim from. Logistical issues can be solved; it's whether enforcement will take place that's the problem.

Alternatively, if it costs a ton to pay lawyers to sift through all that and do all that arduous work so that copyright holders can be paid proactively? Again, we put the burden on Meta. It can be done, it's just an expensive task. But it should be their expensive task. That's not even a 'punishment'--it's just making them do what they should have done in the first place for each and every book that they downloaded. The fact that the scale of the task is immense is a mountain of their own making, and is their problem, not ours.

Anyhow. The lawsuits will be incoming. And in those 7.5M books, I bet there's a lot of IP owned by some big players. I haven't checked but if there are Star Wars novels in there or content owned by other massive corporations, you can bet they'll be stepping up with big lawyers and deep pockets.
 

This is going to be an unpopular opinion:

Training AI on existing works isn't piracy in any reasonable definition of the term. It isn't necessarily ethical, but it isn't piracy. By intentionally using an incongruity term and trying to shoehorn it into your argument, you actually weaken your argument.

More simply: if I can't ask.ChatGPT to replicate the PHB, it isn't piracy.
It's obviously and unarguably piracy in this case. They literally used pirated content. Pretending otherwise is merely a ridiculous and obviously risible and wrong opinion rather than an "unpopular opinion".

And just because ChatGPT can't replicate something doesn't mean it's not software that was created via piracy, it simply points to limitations and peculiarities in how ChatGPT is designed.

I strongly suggest you RTA in this case too - the people involved were very conscious that they were committing piracy, and did so by choice rather than licencing the works.
 

And in mission critical operations, AI assistance is generally worse than not having it because humans tend to become reliant upon AI assistance and assume it is correct if it has success rate of around 85-90% or better meaning they miss the errors it generated and let them through. This is true on everything from software developers using generated code to autopilot systems (both for cars and aircraft) to people trusting what ChatGPT tells them. It's a long known phenomenon. It's why Teslas (Cybertruck excluded) have better safety systems than many other cars yet have a crash rate higher than the Pinto. People trust the self-drive even though it's not a trustable system.
That's overcome by the business. I know in my particular case, we have extensive training for everyone in the company (not even just the technical people) on the ethical use of AI and the use of AI in the business and regulations around that use - and use outside of that is a punishable offense up to and/or including termination. I can't imagine we're the only ones that have those guardrails in place.
 

Again, an LLM itself does not contain copyrighted material. It is basically a collection of facts ABOUT the copyrighted material (mostly 'what word probably follows this other word' but still is not the copyrighted material itself). Placing the LLM into the public domain would not give access to the copyrighted material used to build it.
This depends. Subsequent training and improvement depends on provenance, which can include sources. That's how you get incremental improvement and how you're able to follow back on erroneous introductions to the model.
 


Right, but those are still individuals. Now, Napster wasnt, but hows Napster these days?
Been following this thread, but here is a total aside. I was reminded of your question about Napster when I heard about this today:

Edited to add: Not condoning Napster's actions/conduct/business model. Just odd news that came up today.
 


Remove ads

Top