it's been proven over and over that it's not stealing. Training is not stealing.
Generative AI learn EXACTLY like people do. They see images and it changes their neural network. They aren't saving images anywhere. They don't copy the pixels. They see it and it changes their brain. exactly like humans do.
well ... no.
Dealign with the second point first, current main belief its that people store information as concepts -- we relate concepts to each other and that is how we build knowledge. GenAI very explicitly does not have concepts -- it deals entirely with expressions. So when you read Lord of the Rings for the first time (you lucky thing!), you mind is changed by creating a new concept, fore example, an Ent which your mind links to other concepts, like trees, Tolkien, fantasy, that story you read in 3rd grade -- a whole host of linkages. An LLM does not do that, it remembers only links between the words. So it remembers all the words that Tolkien used in passages about Ents, and those words are liked to other words.
This is a really crucial difference as it explains why LLMs are more likely to violate copyright than you are. In fact, there is a good argument to make that what LLMs store is actually a lossy compression of the materials they train on. They store relationships between words, and so when they produce text, they really want to reproduce the material they were trained on.
Here's a relevant article:
Extracting memorized pieces of (copyrighted) books from open-weight language models, with abstract:
... Through thousands of experiments, we show that the extent of memorization varies both by model and by book. With respect to our specific extraction methodology, we find that most LLMs do not memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B entirely memorizes some books, like the first Harry Potter book and 1984. In fact, the first Harry Potter is so memorized that, using a seed prompt consisting of just the first few tokens of the first chapter, we can deterministically generate the entire book near-verbatim.
Llama is ancient, and most modern LLMs have specific safeguards that prevent this happening. However, they are post-training add-ons that prevent the LLMs doing what they have been trained to do, rather than being fundamentally different.
TLDR: Humans train via concepts; LLMs via words. Which is why we are terrible at memorizing words and LLMs are terrible at conceptual thought.
For your first point, this has not been established in courts as a general rule. I know some courts have said that if you have the rights to a piece of work, then training on that works creates a derivative work that is not subject to copyright laws. However, that decision did not have the evidence we now have that LLMs can reproduce copyright material near verbatim. There are also plenty of other legal opinions. For example, the legal professionals I have worked with do not believe that anyone has the right to train LLMs on Private Health Information, and then use the trained model outside of care for those specific patients. Some vendors have legal experts who disagree, and believe that if you use the LLM both for those patients AND others, it's OK. I admit that we are on there conservative end of this debate (and I am happy to be there), but we do have to admit this is not a cut and dried opinion.
To be clear, if you believe that no form of training is stealing, you re saying it is OK to train LLMs on your personal data, financial, medical and other, knowing that there is a good chance the LLM can reproduce that on demand for any use by any users. I think for most people, they would prefer that was not the case.