The question is whether they acquired the data set legally. As far as I know they did. They certainly wouldn't have the right to distribute that data set or sell it, but the data set itself - which I admit I haven't really given much thought to - doesn't strike me as critical to the discussion. I think we both agree they would have had a right to read the data or look at the data. I always assumed that they just picked a bunch of things on the internet to scan which they could legally read and then did so.
My objection isn't that OpenAI read a bunch of data they scanned from the internet. If all they were doing was having their AI read the contents of their browser cache as it was being displayed in the browser window, I don't have any problem with that. I don't think it's possible for a copyright holder to post publicly-facing internet content without implicitly giving permission for people (or entities) to create copies of that content for the express purpose of viewing those copies in internet browsers. Allowing those cached copies to exist is literally the express purpose of posting public-facing content online, so it would be non-sensical to say you're posting public-facing content online but also withholding that permission.
My objection would be if, as I suspect is the case, OpenAI copied the content they scraped from the internet into a training database separate from any browser cache. Saving a new copy of online content into a database separate from your browser cache is not standard practice when viewing a website, so there's no implicit permission to do so.
This has nothing to do with distribution or derivative works, and everything to do with
reproducing a copyrighted work. You literally need permission to
create a copy of a copyrighted work. Sure, that permission is implicitly granted to users of internet browsers for the purpose of reading a website on a browser. But at no point did OpenAI ever ask any copyright holder permission to make
an additional, unauthorized copy of their work for an entirely different purpose, in an entirely different block of memory, unrelated in any way to reading it on a browser.
It would be interesting if the terms of service of any of those sites at the time specifically blocked the use of data for machine training, but I doubt it unless there was some blanket prohibition against using the text as the basis of scientific research. No one was thinking about those things at the time. No one was regularly saying anything in their terms of service like, "Sign here if you agree when accessing this information not to use it to train an AI."
It doesn't really matter what the terms of service on the websites were. Websites don't have to establish additional restrictions on the use of copyrighted content to prevent it from being copied to anything other than a browser cache. Copyright law already does that.
By default, you cannot legally create a copy of a copyrighted work without getting permission from the copyright holder. Permission to create a copy in your browser cache for the purpose of viewing a website is
not permission to create an additional copy for some other purpose. You don't get to say, "Well, it's in my browser cache, so I'll just save a copy to my training database, too." By default, copyright law says you need to actively receive permission to make that second, non-browser-related copy of that data.
If, at any point, OpenAI copied copyrighted content stored in their browser cache and pasted it into a database (or other, non-browser-related file), I would content they violated a copyright. My understanding of copyright law is that OpenAI isn't allowed to copy the data in their browser cache to a new location without first getting permission to do. (And if they're skipping the browser and just scraping copyrighted material directly into a training file, I'd that's an even more blatant copyright violation.)