California bill (AB 412) would effectively ban open-source generative AI

More like they are forcing companies in the state/with connection in the state to disclose all copyrighted material used in training or force some heavy consequence.
Not just companies but EVERYONE including the guy who put together a little model in his bedroom.

Google is incorporated in Delaware. So is Meta. DeepMind (Google's lab) is from the UK. Almost 0 of their data centers are in California. The training happens in Texas, Oklahoma, Tennessee, etc.

So we will only rely on Chinese companies to open source their models from now on?
 

log in or register to remove this ad


Guttenberg library
Many of the e-books there are from very old books (the reason why they are free), written in a language style and using words that nobody would have used for decades. On top free old e-books didn't mean they are written in a very good language quality anyway. Language has evolved a lot and using this data without processing/filtering it as training material does not make the AI better, but only worse, because language from the last 100+ years is wildly mixed together and the AI naturally cannot separate which language comes from which time, it mix it all together. LLMs have problems understanding the aspect of time anyway.

And speaking of Project Gutenberg has implemented one of the worst AI fears of striking actors
 

Man, this bill was obviously written by someone who knows nothing about the technology (as it par for the course in this sort of thing).

I decided to ponder a bit of the points of the bill via ChatGPT.
Here's a link to the conversation if anyone would like it.



I'd like to draw your attention to a few specific lines:

(d) “Generative artificial intelligence” or “GenAI” means an artificial intelligence system that can generate derived synthetic content, including text, images, video, and audio, that emulates the structure and characteristics of the system’s training data.
This line that includes the phrase "derived synthetic content" seems a bit... inaccurate.
What dictates what "synthetic" content is? A picture/text/etc either exists or it does not.

As far as i'm aware, there's no middle ground for "synthetic" or "organic" when it comes to media.
And if someone did claim that, anything generated via a computer would be "synthetic" in nature.

And some input from ChatGPT:

  • The word synthetic here is philosophically and technically vague. Almost all digital content is "synthetic" in the sense of being constructed by tools.
  • There's no legal or engineering standard that distinguishes between "organic" and "synthetic" content. A digital painting made by a human in Photoshop and a generated image from an LLM are both "synthetic" under any reasonable lens.
  • This phrase seems meant to sound ominous or futuristic without actually providing clarity.


And this line:

(a) “Artificial intelligence” or “AI” means an engineered or machine-based system that varies in its level of autonomy and that can, for explicit or implicit objectives, infer from the input it receives how to generate outputs that can influence physical or virtual environments.
A "machine-based system ... that can infer from the input it receives how to generate outputs...." pretty much just sounds like ANY piece of software. Take photoshop, for example. it can "infer" how to alter an image based on "input" via how it is coded.

And ChatGPT's take on it:

  • That broad definition:“machine-based system that can infer from input how to generate output...” — is so vague that it encompasses everything from Excel macros to recommendation engines.
  • This creates serious overreach. For example, would procedural generation in video games count? What about autocomplete in email?
  • The language reflects a non-technical or overly cautious legal mindset trying to cover all bases and ending up in a definitional swamp.


And here's a few more closing thoughts from ChatGPT:

You asked whether it's really well-intentioned. That's a fair and important challenge. A few things to consider:

  • Optics over effectiveness: Politicians often write tech legislation more to appear proactive than to actually solve problems. It wins headlines and appeals to concerned creatives—even if it’s unworkable.
  • Pressure from copyright lobbies: Large media orgs and artist unions are pushing hard to make AI companies liable for scraping public data. The bill may be less about fairness and more about enabling lawsuits and licensing regimes.
  • Chilling effect: Whether intentional or not, the bill advances the interests of legacy content holders by making it harder for open-source or indie developers to compete with well-funded incumbents who can afford to license or litigate.
This bill is:

  • Technically unrealistic
  • Legally questionable (due to federal copyright preemption)
  • Potentially innovation-stifling, especially for small devs
  • And conceptually confused, using vague or incorrect language that muddies rather than clarifies the regulatory goal
If it passes and is enforced, California could see an exodus of AI development or a chilling of open, collaborative research in favor of tightly controlled, well-funded corporate models.
 



It's a half-baked attempt at plugging a hole in a ship that's already at the bottom of the ocean. Pandora's box is already open. Cats do not go back into the bag if you shake your fist angrily at them. This is about optics and giving grounds for lawsuits to happen, not to actually figure out a solution to a problem.
I live in CA. And I hope it passes, and I will vote in support of it.
Yes, I read the article.
 


How exactly can you ban AI at a state level?
You really can't, but a lot of tech companies operate in CA and are forced to comply with its laws unless they want to move their HQs (which they can, but most just spend their moving money on lobbying instead). California is also often considered the starting point for any tech laws. If something passes there, you will likely see it in other states.
 

Many of the e-books there are from very old books (the reason why they are free), written in a language style and using words that nobody would have used for decades. On top free old e-books didn't mean they are written in a very good language quality anyway. Language has evolved a lot and using this data without processing/filtering it as training material does not make the AI better, but only worse, because language from the last 100+ years is wildly mixed together and the AI naturally cannot separate which language comes from which time, it mix it all together. LLMs have problems understanding the aspect of time anyway.

And speaking of Project Gutenberg has implemented one of the worst AI fears of striking actors
That same argument would hold for using a training dataset containing, let's say, physics textbooks and period romance novels. Language, style, context are completely different, so if AI is able to properly understand the meaning of, for example, 'force', 'energy', 'momentum', 'field', as used in a scientific context as opposed to how we use them in everyday language I don't see why it should not get that today "computer" is not somebody whose job is performing complex math calculations.
 

Not just companies but EVERYONE including the guy who put together a little model in his bedroom.
I think the barrier to entry is so high already that the "little guy" is still someone worth at least seven figures. No way I'm training anything close to a usable model on my $500 five-year-old laptop. I don't even think a top of the line gaming rig is anywhere close to enough. (though maybe I could run stable diffusion locally on my laptop)

And well, if someone does train models using only public domain, and donated material for creating open access models, I'm sure people will gladly help. Not a real problem unless they use petabytes of pirated material.
 

Remove ads

Top