I don't know what they did, but if this is the basis of your objection, are you saying that if they fed the input directly into the training program without making two copies of it on the computer (indeed potentially just feeding it into a service of some sort and thus never making any permanent file of it) that in your mind the whole procedure would now be legal?
I don't know what you mean by feeding input directly into a training program without creating a copy of it. I mean, I understand that input can be parsed and processed a few bits at a time without ever storing a full copy of the data, but sequentially parsing and storing every bit of data from a file in a buffer is functionally equivalent to copying the file. If you hire a hundred people to photocopy one page each of a hundred page book, you've effectively copied the entire book, even if those pages are never assembled all at once in the same location. The actual act of copying the book has effectively occurred in full.
Because if I had known that some judge was going to rule that way and I was running an AI firm I totally would have had the programmers write things that way in order to follow the strict letter of the law. But, I believe you technical approach to legal rights fundamentally breaks down as non-transparent law. The law should never be such that it involves unforeseen technicalities.
I'm not a lawyer, so I can't really have a formal "approach to the law." That being said, I don't think my philosophy regarding copyright law is particularly technical. I'm just considering the spirit of the law: If you hold a copyrighted on certain content, you basically have the exclusive, transferable right to profit from that content. That's all. If you give someone permission to read a copy of that content, they can create a copy of it for that permitted purpose if creating a copy is the only possible way to read it (as it is when reading a website). If you don't give someone permission to use a copy of that content for some other (non-Fair-Use) purpose, they don't get to use it for that other purpose.
A copyright isn't an opt-in right, where the copyright holder has to explicitly enumerate every possible process which might copy or distribute their content in order to prohibit others from using their content in that way. Copyright is an opt-out right, where every possible (non-Fair-Use) process which might copy or distribute the copyright holder's content is prohibited unless the copyright holder has given their express permission for their content to be copied or distributed in that way.
And there is also precedent for why your technical letter of the law approach is flawed that I mentioned before and that is internet browsers. Open AI is far from the first company to scan the whole internet into a database and then make a derivative work of it. Google is the first company to do that. And they are still doing it. They have web crawlers that go out and read all the words, put them into a database and use that data as the basis of making search engines. They then use that search engine as the basis for developing revenue from ads. So your strict letter of the law approach based on technicalities makes not only training an AI illegal, but also building a search index for a web search.
I don't agree that a search engine and an AI training set are legally or morally equivalent in any way. When someone posts public-facing content online, they are implicitly giving permission for internet users to find and read that content via the internet. The express purpose of internet browsers and search engines is to enable internet users to find and read internet content. Those two technologies are making public-facing content available in the manner copyright holders intended, without doing anything further with that content.
AI training sets do absolutely nothing to help internet users find and read any copyrighted content used in the creation of those training sets. If those training sets copy or distribute copyrighted material in any way that isn't expressly Fair Use, they're violating the content creator's rights. The creators intended for their content to be available to read on the internet, with all of the necessary permission implied by that intent. They did not, at any point, give anyone permission to copy or distribute their work in any way aside from merely making it available to read online.
And further, being based on a technicality as I said I could just do this without storing a file at all. And heck for all I know, they didn't ever store a file. Maybe they just put this all into some sort of database structure immediately upon crawling the web with a custom web crawler.
The web browsers that you use to search the web are just one way of accessing and displaying the information on the internet. You can - and I have - write automated web crawlers that involve no human viewing at all and which just download information from websites. In my case, I was downloading genetic transcriptions from NCBI for use in things like automated annotation and eventually protein folding, but you can do this to any website. I have for example for a while now considered writing a simple crawler (with a suitable wait period between requests) to download all my past posts at EnWorld so that I'll have a copy in the event EnWorld blows a fuse.
Depending upon how they're used, I would say web crawlers may or may not be violating copyright laws. The NCBI is a government organization, so by my understanding of U.S. copyright law, the content it creates during the course of fulfilling its government function isn't protected by copyrights (and even if it were, non-commercial, academic use of its content is Fair Use).
Also, archiving a website to preserve its content in case of data loss is a long-established Fair Use case, so a web-crawler isn't violating any copyrights by, for example, creating a back-up copy of ENWorld for the express purpose of data preservation.
On the other hand, if I use a web crawler to find and download all bootleg copies of Disney films posted anywhere on the web because I want free copies of Disney films on my computer, I don't see any way that's not violating copyright law. Ditto if I use my web crawler to find and download all copyrighted images posted anywhere on the web because I want to use those images for some non-archival purpose (i.e. training an AI).
So is Google also guilty of a mass copyright violation and can be sued by every website it's ever crawled with its own web crawlers? Is every web search engine also guilty of copyright violation?
If Google starts using its web crawlers to do things which violate copyright law, then yes, I think every website creator should sue Google into the ground in a massive class-action lawsuit and win. (The EU courts might even let something like that happen, given their track record with tech companies.)
As I noted above, though, I don't see how search engines violate anyone's copyright. They are specifically enabling the permitted use of the copyrighted content in the manner the copyright holder intended for it to be used.
I think you are getting lost down a technical rabbit hole that doesn't really matter.
Since it seems you don't think my position in this debate matters, I guess we don't have anything further to discuss. I've given my two cents, so I'll bow out and give you the last word. Cheers.