Unbelievable Scale of AI’s Pirated-Books Problem

ASchmidt · Mar 21, 2025

The problem is the unreliability. You can't let anything with a failure rate this high anywhere near anything where it could do damage. That there's a bill to allow AI's to prescribe medications absolutely blows my mind.

chuckdee · Mar 21, 2025

ASchmidt said:
There are uses for current types of "AI" in working with very large data sets for analysis that are useful in both business and scientific applications. LLMs and what we have now generating art and video are significantly limited by lacking understanding of context. Which means the hallucinations we've all seen are not a fixable thing without completely changing how the system works.

And that's the point that I'm making. You're generalizing that all AI and all Gen AI will disappear based on a use case that isn't really the case that many businesses use it for. It's perfectly useful for other applications in business that give a bigger lift that creating a video or an image.

If you'd said something like the use of GenAI to wholesale create art or video might be cooling off soon because of limitations, that would be a very much smaller scope than all GenAI will reach a cooling off period, and it would be one that I could possibly see.

But just the amount of time it saves in a lot of use cases is a reason that it's nowhere near a cooling-off period in business applications, and that's just touching the tip of the iceberg.

Umbran · Mar 21, 2025

MoonSong said:
What loss if I bought the album anyway?

Well, the loss of selling you the single and the album, instead of just one of them. And, yes, you can say, "It is just one copy of one song". But that's looking at one tree, and ignoring the forest of thousands and thousands of people willing to say the same thing.

In the end, it is a rationalization, and we don't actually have the right to impose our rationalization on the creator. The term is copy*right*, they have the right to choose, we do not have the right to choose for them.

TiQuinn · Mar 21, 2025

ASchmidt said:
At the end of the day though, I think we're going to see "AI" die down a bit "soon".

chuckdee said:
You're generalizing that all AI and all Gen AI will disappear based on a use case

I don’t think “die down a bit” and “all AI will disappear” are synonymous.

Belen · Mar 21, 2025

Reynard said:
This is going to be an unpopular opinion:

Training AI on existing works isn't piracy in any reasonable definition of the term. It isn't necessarily ethical, but it isn't piracy. By intentionally using an incongruity term and trying to shoehorn it into your argument, you actually weaken your argument.

More simply: if I can't ask.ChatGPT to replicate the PHB, it isn't piracy.

It is theft if they used the content and did not pay for it.

I am mildly amused at all the open access advocates that screamed for years about big publishers now realizing that giant billionaire company AI just took all that work to train their models for free to make more money.

In STM, authors pay to make it OA and now companies get to train AI and monetize it and they get nothing.

Belen · Mar 21, 2025

Scribe said:
So it's interesting. I've been reading a LOT of medical articles of late for reasons.

Many, you dont get the full release, which it seems is what this LibGen was intended for (or wants to claim), the offering up of information that could advance things that could be quite critical to people's lives.

Novels, Game Books, and the like, are critical for people to get paid for so they can make a living.

I'm pretty sure these things are not equal, but I'm also sure that it doesn't matter to anyone but the people ripping, or getting, ripped off.

Medical articles are critical for a lot of people to get paid too. Editorial assistants, managing editors…..peer review and copyediting are not free.

Scribe · Mar 21, 2025

Belen said:
Medical articles are critical for a lot of people to get paid too. Editorial assistants, managing editors…..peer review and copyediting are not free.

Fair enough, I have no clue how that side of things works.

Sacrosanct · Mar 21, 2025

I don't care what the pedantic arguments might be made are, or the longstanding historical precedence of exploitation, it doesn't sit right with me to use someone else's labor without their permission to make yourself wealthy while they stay poor. Just as an overall big picture stance.

And yes, I am fully aware of the AI arms race and the dangers and risks involved of not pushing AI really hard (the country that "wins" the AI arms race will be like if they went to war with an 18th century nation in today's technology).

The Sigil · Mar 21, 2025

Misconception: LLM's store a copy of all data they were trained on, and therefore are prima facie breaking copyright if they do so without explicit permisssion.

The Reality: LLM's do pattern analysis on the training data they are given. What winds up getting stored in the LLM database is a list of the words and "word-like concepts" (such as "punctuation", "end of sentence", "end of paragraph", "end of document", etc.) they've seen in the training data and a bunch of mathematical weights that basically reflect "how often did I see this word in close proximity to another word?" For example, the entry "purple" might have a high mathematical weight for "how often do I see this with 'people'" and "how often do I see this with 'eater'" and a low weight for "how often do I see this with 'deoxyribonucleic'".

So an LLM stores a statistical survey of how often words appear in proximity to each other in whatever training data it was fed. The LLM simply catalogues facts about the work rather than storing the work itself. The US Supreme Court has held that facts are not created, they are discovered. This means even if creative ingenuity was required to bring a work into existence in the first place, a fact ABOUT the work does not require creative ingenuity (for example, I don't think anyone would argue that giving a book's word count constitutes copyright infringement on the book itself; or even a more specific example like, the word "Remember" occurs 121 times in the King James bible - my statement of fact about a work is not considered infringing).

Of course, if you understand the process by which data was encoded INTO an LLM, you can set up a way to somewhat "reverse engineer" the process to bring something that looks like creative writing out. The simplest way to do this would be to say "start with the word 'The' and then return the most probable next word. Keep iterating until the most probable next word is 'end of document'." Of course, doing this particular thing would get you the same result (assuming the same LLM) every time, so a little bit of randomness is baked into most Generative AI prompts (instead of "next most probable word" you might say "at each step roll a d6 and give me the nth next most probable word").

More to the point, assuming you trained on more than one work, it is almost impossible to get any one specific work out, start to finish (though you might get "snippets" of a book in its commonly-used phrases... which I suppose isn't that much different than human writers relying on commonly-used phrases), as the more data you put into it, the less any specific work will be able to dominate the probability dataset.

AI/LLM is not good at creating new work. Rather, it's an algorithm that predicts an average response if you "asked" all of its training data. If all of its training data is skewed or has bias, the AI response will be skewed/have bias. But since content generated AI has no human component to it - there's nothing sentient trying to express an idea - AI work is generally held to be uncopyrightable. It's an expression of facts that "looks like" it has intelligence behind it, but really it's just a fancy way to generate a statistically average response to a prompt based on general language patterns.

I should also point out that as a human creative, I have read tens (hundreds?) of thousands of books in my lifetime, and millions of words... and in my own head have encoded something about how the English language works similar to what an LLM encodes in order to express my thoughts using the English language as I understand it. Just because 30 years ago I read "insert book here" does that make this post a derivative of that book because I'm using the model of English that the book helped me create in my head to help me pick the words that are going into this post? Am I committing copyright infringement now? There are only so many combinations of words and letters to express an idea, and many of them (especially, say, prepositions) are very commonly used! I don't believe am infringing on someone else's use of the word "the", and I don't think most people would argue I was either. I think generally we call "infringement" when we can directly connect a significant portion of a work's style or tone to another specific work.

(There is a similar issue going on in the music world, where some musicians are arguing that use of common chord progressions or rhythms is or is not infringing. It's only more obvious in the music world because there are fewer permutations of notes and rhythms in music than there are in the English language, but the problem is the same... how easy is it for two people based in the same general music traditions or English language to independently come up with similar songs or sentences, and if they came up with them independently, is in infringing?)

We've been dealing with the copyright question for hundreds of years without a satisfactory answer. The concept of copyright incorporates into it the idea that a novel thought or idea is being created. Just because AI-generated text (or art, or music, or whatever), looks very similar to human-generated stuff to our human sensibilities does not mean a novel thought or idea is being created! AI creates nothing new, but I don't think in the strictest sense it does so by infringement! It does so by analyzing many inputs and creating a model for how an idea is expressed in words (or pixels or soundwaves or whatever). The "creative" point in the AI process is the inputting of a prompt (done by a human) and the output generated is based entirely on uncopyrightable facts about copyrighted materials.

Those whose livelihoods are dependent on being paid for their efforts to use the current tools (pen and paper, typewriters, computers, cameras, paint and canvas, etc.) are, IMO, misplacing their frustration at AI. It is not the "creative process" that is being replaced by AI! Instead, it is the secondary process by which they encode some physical representation of the result of their creative process into a fixed medium which is being replaced (and that's where almost all of the work currently is)! As the saying goes, "ideas are cheap" - I would suggest for most artists, the original idea for a story or painting or film comes quickly, but the work is in taking that vision out of one's head and putting it into words (or canvas or film or whatever). They have taken time to hone their talents in the "Translation" process, and that's the one that's being automated. For some it brings joy, for others, it's work, but at the end of the day, since most of the work has classically been at this step, that's been the work that brings them revenue.

Of course, I'm sure a lot of accountants that used to do adding and subtracting by hand were pretty upset when computer spreadsheets became a thing. Some still prefer to do it by hand, but let's be honest, a spreadsheet does it a lot faster, and for most of us, math is tedious work and we're grateful not to have to do it ourselves... we're glad to see it become something we can get a computer to do for us (most fast-food workers 50 years ago could calculate change in their head but today, the change they give you is "whatever the register says" because they don't want to do the math). Technology makes certain skills less valuable... it's a sad state of affairs for those whose skills are being obsoleted, but it is something of an equalizer for those of us that don't have those skills! I know I much prefer to let technology weave my clothes so my fabrics are more uniform than they were when everything was done by hand (and especially so I don't have to do it myself)!

The current hot saying in tech circles is, "you aren't going to be replaced by AI, you will be replaced by someone who knows how to use AI." Essentially, AI tools are a force multiplier to go from "idea" to "fixed medium." Does the creative process happen in the forming of the idea or in the fixing of the idea to a medium? I would venture to say the creative process happens in the forming, not the fixing. Is the "humanity" and "creativity" in art the mental labor of creating the idea or the physical and mental labor in translating the idea into a fixed medium? I would dare say that if a machine can employed to fix an idea, but not to originate the idea to be fixed, the "humanity" and "creativity" in the work is in the origination of, not the fixing of, the idea. Of course, the problem is that copyright is all about protecting the labor required to "fix an idea in a medium" and not the idea itself (facts are uncopyrightable, ideas are uncopyrightable... only a particular expression of an idea is copyrightable).

I tend to agree with the sentiment of "I wanted AI to do my laundry and wash my dishes so that I could create art and write books! I didn't want AI to create art and write books so I could do my laundry and wash my dishes!" In other words, I want tools to do the tasks I personally find tedious so I can focus on the stuff I find personally fulfilling! I understand that for many artists, the process of fixing their art into a medium IS the "personally fulfilling" part and for some things I agree. But for a very specific example, I am horrible at painting. I find it tedious, I'm not physically that adept and blending colors and subtle brush strokes and light and shadows, so for me, using something like Midjourney to create a beautiful painting to express an idea I have (say, creating a portrait of an NPC) eliminates tedium and lets me get on to things I like to do (imagining NPCs). I'm not a fan of George R.R. Martin but many people in this community seem to be and I think the biggest frustration with him has been that he has not been able to put in the work to complete the Song of Ice and Fire series. I suspect he finds the work both enjoyable AND tedious (the tedium is why he hasn't finished it yet) and if there were a way to have him generate a sufficiently specific prompt to AI to spit the rest of the series out in a day, provided the resulting story matched his vision and style, most of his fans would want that.

So while I do have some trepidation about where AI is going and the disruptions it is making, I also understand that my own discomfort with it mostly stems out of the fact that traditionally, technological disruptions have often made making a living miserable for those whose jobs they replaced, even if on a generational scale, nobody "missed" those jobs several generations later... and I know mostly been used in the short-term to enrich the ruling class rather than spread general prosperity. On the other hand, I am happy to hope that over time, many of the things I find tedious today can be eliminated (then of course the things I currently take joy in I'll probably find tedious).

All of this is a very long way of saying, "I'm not in a hurry for AI art to take over human-created art, but I am excited that AI art might lower the barrier for entry so that more ideas can be quickly translated into shareable media." I'll still value the hand-crafted, artisanal stuff in the same way I value the crocheted blankets my grandmother hand-made for me, and I'll keep those around to treasure their craftsmanship, but when I need a new, warm, blanket, I'm more likely to go get some artificial fleece blanket since those are bigger, warmer, and cheaper even if they are less sentimental... because I'm looking for warmth in that blanket, not sentimentality.

(This post is probably going to make me somewhat unpopular but I think this is a very complex issue and there's not a simple "AI good" or "AI bad" view that can easily be justified from all sides - of course, anyone whose labor is being threatened by AI is very justified in "AI bad" and anyone profiting of AI is going to say "AI good" but for those of us for which AI has the prospect of improving some aspects of our life and hurting others, there's a lot of grey there).

Morrus · Mar 21, 2025

The Sigil said:
I should also point out that as a human creative, I have read tens (hundreds?) of thousands of books in my lifetime, and millions of words...

How many of them did you illegally torrent?

Unbelievable Scale of AI’s Pirated-Books Problem

ASchmidt

Explorer

chuckdee

Adventurer

Umbran

Mod Squad

TiQuinn

Registered User

Belen

Legend

Belen

Legend

Scribe

Legend

Sacrosanct

Legend

The Sigil

Mr. 3000 (Words per post)

Morrus

Well, that was fun

Similar Threads