Unbelievable Scale of AI’s Pirated-Books Problem

ASchmidt · Mar 21, 2025

chuckdee said:
And that's the point that I'm making. You're generalizing that all AI and all Gen AI will disappear based on a use case that isn't really the case that many businesses use it for. It's perfectly useful for other applications in business that give a bigger lift that creating a video or an image.

If you'd said something like the use of GenAI to wholesale create art or video might be cooling off soon because of limitations, that would be a very much smaller scope than all GenAI will reach a cooling off period, and it would be one that I could possibly see.

But just the amount of time it saves in a lot of use cases is a reason that it's nowhere near a cooling-off period in business applications, and that's just touching the tip of the iceberg.

My main issue is that I see "AI" being used in use cases where it's not really a good solution simply because business can generate interest simply by including "AI". That it's being used as marketing either to the end-user or towards other executives who don't know better. We'll just slap some AI on it and look at us, we're futuristic!

And in mission critical operations, AI assistance is generally worse than not having it because humans tend to become reliant upon AI assistance and assume it is correct if it has success rate of around 85-90% or better meaning they miss the errors it generated and let them through. This is true on everything from software developers using generated code to autopilot systems (both for cars and aircraft) to people trusting what ChatGPT tells them. It's a long known phenomenon. It's why Teslas (Cybertruck excluded) have better safety systems than many other cars yet have a crash rate higher than the Pinto. People trust the self-drive even though it's not a trustable system.

So the die down I'm talking about is a scaling back in how applications and services utilize AI where executives got over eager and are suddenly finding out that it won't deliver quite as they expected. I'm not saying there aren't totally valid use cases. I'm saying executives caught another tech marvel coming out of the Valley and got over eager about it and now they're out past the end of their skis. There are going to be some absolutely perfect uses for it, no doubt at all. But the current state of "AI" is evolutionary, not revolutionary.

rhythmsoundmotion · Mar 21, 2025

(I removed an image that was found by googling Tony Stark as Ironman)
An AI generated image. It features copyrighted material including Ironman and Tony Stark which are owned by Marvel. It includes an unlicensed likeness of Robert Downey Jr. It was created by using unlicensed visual assets that was tagged by human workers to include “Robert Downey Jr.” , “Ironman”, and “Tony Stark.”

The Sigil · Mar 21, 2025

Morrus said:
How many of them did you illegally torrent?

C'mon Morrus, you know that's a "do you still beat your wife?" sort of question.

Perhaps I should clarify when I made the statement above, every one of them was referencing physical books (I grew up before e-books existed - I know, I'm old!). I had a deep and abiding love affair with my local public libraries from the time I was about 4 years old (when I was 5 years old I got an award for reading 400 books during the school year... and it wasn't stuff like illustrated 32-page "The Berenstein Bears" - it was all-text 200+ page stuff like "Little House on the Prairie"). Generally I went through at least one backpack full of physical library books per week from the ages of about 8 through 16. Figure 20 books per week times 50 weeks times 9 years... 9,000 books at least. I literally read every (physical) book at my school libraries and almost every book at my local public library in that time period.

I've probably read much more than that if I count e-books. I do respect the creative process and will try to pay for e-books whenever I can. My DriveThruRPG.com library currently lists 1667 items (all paid for with a few free I suppose). I have a lot of PDFs from the old "ESR" service WotC did before they moved their works to DTRPG... again all paid for (except for the few WotC made available as free downloads, but as the copyright holder, they are entitled to make them available for free and I am within my rights to download copies from them).

That said, as far as torrenting books, I am of the opinion that if I own a physical copy of a book, it is not unethical (though perhaps strictly illegal) to download a copy of that same book. That's of course a completely separate discussion and probably not fit for this thread. But in general, I don't consider myself entitled to "steal" something when I can buy it... and if I can't afford it, I'll do without.

Morrus · Mar 21, 2025

The Sigil said:
C'mon Morrus, you know that's a "do you still beat your wife?" sort of question.

No, it’s not.

It’s a rhetorical question which points out that you legally reading books you own is not the same as Meta illegally pirating them. Meta is breaking the law. You (presumably) were not.

ASchmidt · Mar 21, 2025

Morrus said:
No, it’s not.

It’s a rhetorical question which points out that you legally reading books you own is not the same as Meta illegally pirating them. Meta is breaking the law. You (presumably) were not.

For my part, it's shocking that a corporation of that size decided to torrent files like that and their legal department didn't immediately and collectively croak on the spot from sheer horror. Then there's the sheer scale of what they did. Like they didn't do just a little piracy, they went to industrial scale piracy.

SlyFlourish · Mar 21, 2025

Hmmm...

The Sigil · Mar 21, 2025

Morrus said:
No, it’s not.

It’s a rhetorical question which points out that you legally reading books you own is not the same as Meta illegally pirating them. Meta is breaking the law. You (presumably) were not.

Ah, I didn't see that's where you were going with this. That makes it a fairer question.

I did find an interesting article on the subject here: On Copyright, “Facts,” & Generative AI. I'll highlight a couple of excerpts here:

Google Books, for example, was held to be a “fair use” that did not infringe copyright, in part because “the purpose of Google’s copying of the original copyrighted books is to make available significant information about those books,” such as where or how frequently a given word appears in a text.
...
courts have also held technologies like image search and plagiarism-detection software to be non-infringing even though they, like Google Books, entail large-scale reproductions of copyrighted materials.

I'm not familiar with the specifics of the Google Books case above, nor of the Meta case you are referencing. I suppose it's possible Google purchased a copy of each book they scanned for Google Books (though I have to admit I am doubtful) and Meta is torrenting all of the books they are utilizing. But given the above, I have to wonder whether or not the law would find that Meta's use is substantially similar to Google's and therefore not infringing (and thus not "illegally pirating them").

Of course, I may ALSO be of the opinion that on some or all of the above points, “If the law supposes that,” said Mr. Bumble,… “the law is a ass—a idiot." (CHARLES DICKENS, Oliver Twist, chapter 51) but that's neither here nor there. I am not a lawyer, this is not legal advice, etc.

SlyFlourish · Mar 21, 2025

To me, the argument comes down to this:

OpenAI and other companies are being valuated at hundreds of billions of dollars. They'd be worth nothing if they didn't have this material. Therefore this material has value. Value they should pay for. If the value of that information is nothing, OpenAI should be worth nothing.

Their fancy space-age algorithms wouldn't be worth anything if they didn't have the intellectual property of just about all human knowledge. If its worth billions, it's worth billions to us.

I'd feel differently about this if all the models they created were considered public domain. Whatever work OpenAI does has to be given to the public. I know Facebook's models and others are mostly available but if they got them by scraping the whole internet, they should, at least, be given back to the whole internet.

That doesn't solve all the other problems with LLMs not limited to:

Sounding really good until you realize they're giving you 20% BS and ommitting 50% completely.
Filling the world with carbon.
Filling the internet with slop.
Tricking bosses into thinking they can fire people.

Scribe · Mar 21, 2025

SlyFlourish said:
Hmmm...

View attachment 400315

On the plus side, you can get in on any potential class action lawsuits?

I'm sure there are folks thinking 'well he doesnt need to be paid for his efforts anyway, in fact in my ideal soci.....'

The Sigil · Mar 21, 2025

After reading the article linked in the OP, the below seems especially relevant"

Internal communications show employees saying that Meta did indeed torrent LibGen, which means that Meta could have not only accessed pirated material but also distributed it to others—well established as illegal under copyright law, regardless of what the courts determine about the use of copyrighted material to train generative AI.

So yes, Meta's downloads definitely seem illegal to me (again IANAL, TINLA) and I would think that any LLM derived from it is not a "neutral" proposition but is also infringing ("fruit of the poisonous tree") but I know better than to think what seems to be common sense to me is common sense to the courts as well.

SlyFlourish said:
To me, the argument comes down to this:

... They'd be worth nothing if they didn't have this material. Therefore this material has value.... Whatever work OpenAI does has to be given to the public. I know Facebook's models and others are mostly available but if they got them by scraping the whole internet, they should, at least, be given back to the whole internet.

To me that to Sly's argument, ethically it seems correct that some sort of legal equivalent to Disgorgement should come into play here (if an illegal activity results in a monetary gain, the law can strip that ill-gotten gain away). If Meta engaged in illegal activity to build its LLM, and that LLM has value, the law should be entitled to strip that value away (and probably open it up to the public since ill-gotten monies clawed back by disgorgement generally go back into the public coffers).

For better or worse, the LLM built on this massive infringement exists, it's not going to un-exist. Making the LLM itself public domain is probably the correct ethical course of action since it will be all but impossible to distribute any sort of monetary damages among all the copyright holders whose work was infringed in the torrenting (and in class action suits, generally nearly all the money goes to the lawyers, not members of the class).

Unbelievable Scale of AI’s Pirated-Books Problem

ASchmidt

Explorer

rhythmsoundmotion

Adventurer

The Sigil

Mr. 3000 (Words per post)

Morrus

Well, that was fun

ASchmidt

Explorer

SlyFlourish

SlyFlourish.com

The Sigil

Mr. 3000 (Words per post)

SlyFlourish

SlyFlourish.com

Scribe

Legend

The Sigil

Mr. 3000 (Words per post)

Similar Threads