My objection isn't that OpenAI read a bunch of data they scanned from the internet. If all they were doing was having their AI read the contents of their browser cache as it was being displayed in the browser window, I don't have any problem with that. I don't think it's possible for a copyright holder to post publicly-facing internet content without implicitly giving permission for people (or entities) to create copies of that content for the express purpose of viewing those copies in internet browsers. Allowing those cached copies to exist is literally the express purpose of posting public-facing content online, so it would be non-sensical to say you're posting public-facing content online but also withholding that permission.
Agreed.
My objection would be if, as I suspect is the case, OpenAI copied the content they scraped from the internet into a training database separate from any browser cache. Saving a new copy of online content into a database separate from your browser cache is not standard practice when viewing a website, so there's no implicit permission to do so.
I don't know what they did, but if this is the basis of your objection, are you saying that if they fed the input directly into the training program without making two copies of it on the computer (indeed potentially just feeding it into a service of some sort and thus never making any permanent file of it) that in your mind the whole procedure would now be legal?
Because if I had known that some judge was going to rule that way and I was running an AI firm I totally would have had the programmers write things that way in order to follow the strict letter of the law. But, I believe you technical approach to legal rights fundamentally breaks down as non-transparent law. The law should never be such that it involves unforeseen technicalities.
And there is also precedent for why your technical letter of the law approach is flawed that I mentioned before and that is internet browsers. Open AI is far from the first company to scan the whole internet into a database and then make a derivative work of it. Google is the first company to do that. And they are still doing it. They have web crawlers that go out and read all the words, put them into a database and use that data as the basis of making search engines. They then use that search engine as the basis for developing revenue from ads. So your strict letter of the law approach based on technicalities makes not only training an AI illegal, but also building a search index for a web search.
And further, being based on a technicality as I said I could just do this without storing a file at all. And heck for all I know, they didn't ever store a file. Maybe they just put this all into some sort of database structure immediately upon crawling the web with a custom web crawler.
The web browsers that you use to search the web are just one way of accessing and displaying the information on the internet. You can - and I have - write automated web crawlers that involve no human viewing at all and which just download information from websites. In my case, I was downloading genetic transcriptions from NCBI for use in things like automated annotation and eventually protein folding, but you can do this to any website. I have for example for a while now considered writing a simple crawler (with a suitable wait period between requests) to download all my past posts at EnWorld so that I'll have a copy in the event EnWorld blows a fuse.
I think you are getting lost down a technical rabbit hole that doesn't really matter.
(And if they're skipping the browser and just scraping copyrighted material directly into a training file, I'd that's an even more blatant copyright violation.)
So is Google also guilty of a mass copyright violation and can be sued by every website it's ever crawled with its own web crawlers? Is every web search engine also guilty of copyright violation?