Unbelievable Scale of AI’s Pirated-Books Problem

nor of the Meta case you are referencing. I suppose it's possible Google purchased a copy of each book they scanned for Google Books (though I have to admit I am doubtful) and Meta is torrenting all of the books they are utilizing.
Tell me you didn't read the OP without telling me you didn't read the OP. ;)

No wonder we're talking at cross-purposes. It’s literally the topic of this thread. I suggest you reread the OP. We can’t have a conversation if you don’t know what the conversation is even about.
 

log in or register to remove this ad

Tell me you didn't read the OP without telling me you didn't read the OP. ;)

No wonder we're talking at cross-purposes. It’s literally the topic of this thread. I suggest you reread the OP. We can’t have a conversation if you don’t know what the conversation is even about.
In my defense, I did read the OP. I didn't initially read the article the OP linked to. ;)

Hopefully my more recent posts show I have since read the actual article.
 

Speaking of Napster, it's interesting how some have changed their tune (pun intended) about piracy. Either because they got older and wiser or it's finally hitting close to home for them.
I am reminded of Weird Al's (rather tongue-in-cheek) reputed take on Napster - source: Items in your collection - World of Weird Al Yankovic when it was popular. He got "both sides" immediately.

Jeremy McCarthy of Fairfield, CT asks: Hey Al!!!!! What do u think about Napster? I just want to know if you approve.

I have very mixed feelings about it. On one hand, I'm concerned that the rampant downloading of my copyright-protected material over the Internet is severely eating into my album sales and having a decidedly adverse effect on my career. On the other hand, I can get all the Metallica songs I want for FREE! WOW!!!!!

The problem, as with all things, is that when bad things happen to us, it's a tragedy, but when the same bad things happen to someone else, we tend to handwave it as an unfortunate statistic. There are multiple bad things going on here and some people on this forum have literally been affected by it, some are fearful they will be affected by it, and others are handwaving it (including, perhaps, to some extent me in some of my earlier posts - mea culpa).
 


For better or worse, the LLM built on this massive infringement exists, it's not going to un-exist. Making the LLM itself public domain is probably the correct ethical course of action since it will be all but impossible to distribute any sort of monetary damages among all the copyright holders whose work was infringed in the torrenting (and in class action suits, generally nearly all the money goes to the lawyers, not members of the class).
No, actually, because that would give access to copyrighted material and place it in the public domain.

The reality is that Meta should be required to pay a fine to the authors or right's holders for every piece of material they accessed illegally.
 

No, actually, because that would give access to copyrighted material and place it in the public domain.

The reality is that Meta should be required to pay a fine to the authors or right's holders for every piece of material they accessed illegally.
100% agreed. If we're going to be using LLMs, they have to be good citizens themselves. They can't start by stealing every scrap of Intellectual Property just as a kick off. Oh, and then keep doing it for the rest of time so it can stay current. I get that they're more useful to many people if they had unfettered access to everything, but that's just flat out theft. I think the ultimate resolution has to be an update to publication licensing agreements between the publishers and the creators so that the publishers can in turn provide access to the material (where the creators give their approval) for some licensing fee to be worked out so that the creators get their cut. And of course the publishers will take their even larger cut. Frankly, there's been enough mergers in the various media markets that it wouldn't take that many companies to sign on.

At the end of the day, the tools are useful and tools to come will be more useful and they need training data. But creators need to be respected too including the ability to opt out and to get recompense for their work (the heck if I know how to figure that one out as to who gets paid how much based on number of prompts that used their work or who knows). There's no reasonable definition of "fair use" that means I get to steal your work and use it to create work just like it on demand. Where I'm literally using your creation as source data that gets fed into my system. No accreditation, no nothing. Right now even music samples in songs get accreditation and royalties or payment of some kind but an AI creating a song is pulling from piles of "sample data" that came from real songs without accreditation or payment. Not even a blanket payment for just being in the library. The data was just stolen.

Simply put, an LLM doesn't work without source data. None of these corporations have a "right" to anyone's IP. There has to be a reasonable compromise, something similar to what's done with streaming platforms/subscription services. Sadly corporations with billions on the line are anything but reasonable and so it'll be a legal tug of war to find out if creators have rights.
 

No, actually, because that would give access to copyrighted material and place it in the public domain.

The reality is that Meta should be required to pay a fine to the authors or right's holders for every piece of material they accessed illegally.
Again, an LLM itself does not contain copyrighted material. It is basically a collection of facts ABOUT the copyrighted material (mostly 'what word probably follows this other word' but still is not the copyrighted material itself). Placing the LLM into the public domain would not give access to the copyrighted material used to build it.

And while perhaps Meta should be required to pay a fine to the rights holders for every piece of material they accessed illegally, even locating all of the rights holders would be an impossible task (since even if we can identify the authors of each piece of copyrighted material they infringed upon, we have to track down each author AND if copyright was transferred, such as in the case of a work-for-hire, we have to figure out to whom the copyright was transferred including in cases where copyright may have been transferred multiple times), much less compensating them. In the ideal world, yes, this works. In the real world, where it is often difficult to locate rights-holders for even a single work, let alone identify all of them for a number of works on this scale, this is simply not feasible. Idealism is great, but in practical cases, it must bow to reality. Deleting every copy of the LLM might be ideal, but is likewise infeasible. If the data exists in more than one place (e.g., backups) someone is going to have a copy somewhere. It's impossible to put the toothpaste back in the tube.

In lieu of $impossible fines, the outcome that is at least reasonably possible would be the corporate death penalty for Meta with all of its intellectual property assets being forfeited to the public domain. Possible? Sure. Likely? No.

However, the most likely outcome I see based on my observations of how our legal system actually works in my lifetime is that Meta stalls this to death with their lawyers, pays a relative pittance to settle the suit, the class action lawyers get most of the proceeds of the settlement and individual authors can sign up to receive a couple of bucks, and no real punishment happens. It's not right. It's not ethical. It's not fair. But it is what my cynical self expects to see happen. :(
 

A random bit of actually relevant humor: there is another Bruce Baugh with a substantial bibliography. He’s a Canadian philosopher and professor. Party game for those familiar with my work at all: go to the Atlantic article in the OP, scroll down to the LibGen search tool, and put in my time. Can you distinguish his works from mine? :)
 


That said, for these multi-billion dollar megacorps, I'm sure if they simply pivoted to paying for one copy of each thing they scrape it wouldn't badly hitting their bottom lines. The article says millions of books, so tens of millions of dollars, maybe, but that's peanuts to them.

The question is more one of licensing the content, not purchasing a copy of it.
I 100% realize my anecdote isn't the point of what you wrote, but it made me think of something to add to this. A friend of mine years ago received a notification from his ISP that they had provided his contact information to Paramount at the request of their legal department. A few months goes by and then he received a letter in the mail from Paramount's legal team regarding a movie of theirs they had determined he downloaded illegally, which checking his media server he confirmed he did have the movie in question. The letter offered him 2 options: pay a legal fee of a few thousand dollars and they'd go away or potentially end up being taken to court by them and pay much more. He consulted a lawyer and was advised to settle if he didn't want to risk things going further so he did. He hasn't heard anything since and also wisely stopped torrenting media.

So while this didn't reach the courts to determine penalties against him, it ended up costing him more than the $15 or so the movie would have cost him to buy. It would be interesting to see everyone who created something Meta torrented taking them to court for damages and seeing how that shakes out in court.
 

Remove ads

Top