California bill (AB 412) would effectively ban open-source generative AI

trappedslider · 2025-05-03T01:43:22+0100

MoonSong said:
More like they are forcing companies in the state/with connection in the state to disclose all copyrighted material used in training or force some heavy consequence.

Not just companies but EVERYONE including the guy who put together a little model in his bedroom.

Google is incorporated in Delaware. So is Meta. DeepMind (Google's lab) is from the UK. Almost 0 of their data centers are in California. The training happens in Texas, Oklahoma, Tennessee, etc.

So we will only rely on Chinese companies to open source their models from now on?

Scribe · 2025-05-03T01:45:24+0100

trappedslider said:
Not just companies but EVERYONE including the guy who put together a little model in his bedroom.

So we will only rely on Chinese companies to open source their models from now on?

If it's open source, run it off public domain instead of stealing?

trappedslider · 2025-05-03T01:52:16+0100

MoonSong said:
Guttenberg library

Many of the e-books there are from very old books (the reason why they are free), written in a language style and using words that nobody would have used for decades. On top free old e-books didn't mean they are written in a very good language quality anyway. Language has evolved a lot and using this data without processing/filtering it as training material does not make the AI better, but only worse, because language from the last 100+ years is wildly mixed together and the AI naturally cannot separate which language comes from which time, it mix it all together. LLMs have problems understanding the aspect of time anyway.

And speaking of Project Gutenberg has implemented one of the worst AI fears of striking actors

trappedslider · 2025-05-03T01:59:55+0100

Man, this bill was obviously written by someone who knows nothing about the technology (as it par for the course in this sort of thing).

I decided to ponder a bit of the points of the bill via ChatGPT.
Here's a link to the conversation if anyone would like it.

I'd like to draw your attention to a few specific lines:

(d) “Generative artificial intelligence” or “GenAI” means an artificial intelligence system that can generate derived synthetic content, including text, images, video, and audio, that emulates the structure and characteristics of the system’s training data.

This line that includes the phrase "derived synthetic content" seems a bit... inaccurate.
What dictates what "synthetic" content is? A picture/text/etc either exists or it does not.

As far as i'm aware, there's no middle ground for "synthetic" or "organic" when it comes to media.
And if someone did claim that, anything generated via a computer would be "synthetic" in nature.

And some input from ChatGPT:

The word synthetic here is philosophically and technically vague. Almost all digital content is "synthetic" in the sense of being constructed by tools.
There's no legal or engineering standard that distinguishes between "organic" and "synthetic" content. A digital painting made by a human in Photoshop and a generated image from an LLM are both "synthetic" under any reasonable lens.
This phrase seems meant to sound ominous or futuristic without actually providing clarity.

And this line:

(a) “Artificial intelligence” or “AI” means an engineered or machine-based system that varies in its level of autonomy and that can, for explicit or implicit objectives, infer from the input it receives how to generate outputs that can influence physical or virtual environments.

A "machine-based system ... that can infer from the input it receives how to generate outputs...." pretty much just sounds like ANY piece of software. Take photoshop, for example. it can "infer" how to alter an image based on "input" via how it is coded.

And ChatGPT's take on it:

That broad definition:“machine-based system that can infer from input how to generate output...” — is so vague that it encompasses everything from Excel macros to recommendation engines.
This creates serious overreach. For example, would procedural generation in video games count? What about autocomplete in email?
The language reflects a non-technical or overly cautious legal mindset trying to cover all bases and ending up in a definitional swamp.

And here's a few more closing thoughts from ChatGPT:

You asked whether it's really well-intentioned. That's a fair and important challenge. A few things to consider:

Optics over effectiveness: Politicians often write tech legislation more to appear proactive than to actually solve problems. It wins headlines and appeals to concerned creatives—even if it’s unworkable.
Pressure from copyright lobbies: Large media orgs and artist unions are pushing hard to make AI companies liable for scraping public data. The bill may be less about fairness and more about enabling lawsuits and licensing regimes.
Chilling effect: Whether intentional or not, the bill advances the interests of legacy content holders by making it harder for open-source or indie developers to compete with well-funded incumbents who can afford to license or litigate.

This bill is:

Technically unrealistic
Legally questionable (due to federal copyright preemption)
Potentially innovation-stifling, especially for small devs
And conceptually confused, using vague or incorrect language that muddies rather than clarifies the regulatory goal

If it passes and is enforced, California could see an exodus of AI development or a chilling of open, collaborative research in favor of tightly controlled, well-funded corporate models.

gnarlygninja · 2025-05-03T02:05:09+0100

Weird that chatGPT couldn't direct you to the wikipedia article for synthetic media, maybe a sign you shouldn't be relying on it for legal analysis

Queer Venger · 2025-05-03T02:08:32+0100

trappedslider said:
California’s A.B. 412: A Bill That Could Crush Startups and Cement A Big Tech AI Monopoly

California legislators have begun debating a bill (A.B. 412) that would require AI developers to track and disclose every registered copyrighted work used in AI training. At first glance, this might sound like a reasonable step toward transparency. But it’s an impossible standard that could crush...

www.eff.org

It's a half-baked attempt at plugging a hole in a ship that's already at the bottom of the ocean. Pandora's box is already open. Cats do not go back into the bag if you shake your fist angrily at them. This is about optics and giving grounds for lawsuits to happen, not to actually figure out a solution to a problem.

I live in CA. And I hope it passes, and I will vote in support of it.
Yes, I read the article.

trappedslider · 2025-05-03T02:08:52+0100

gnarlygninja said:
Weird that chatGPT couldn't direct you to the wikipedia article for synthetic media, maybe a sign you shouldn't be relying on it for legal analysis

we went from "Don't use Wikipedia as a source" to "hahaha dumb machine didn't point you to Wikipedia"

newsnerd · 2025-05-03T03:47:25+0100

Micah Sweet said:
How exactly can you ban AI at a state level?

You really can't, but a lot of tech companies operate in CA and are forced to comply with its laws unless they want to move their HQs (which they can, but most just spend their moving money on lobbying instead). California is also often considered the starting point for any tech laws. If something passes there, you will likely see it in other states.

briggart · 2025-05-03T04:02:30+0100

trappedslider said:
Many of the e-books there are from very old books (the reason why they are free), written in a language style and using words that nobody would have used for decades. On top free old e-books didn't mean they are written in a very good language quality anyway. Language has evolved a lot and using this data without processing/filtering it as training material does not make the AI better, but only worse, because language from the last 100+ years is wildly mixed together and the AI naturally cannot separate which language comes from which time, it mix it all together. LLMs have problems understanding the aspect of time anyway.

And speaking of Project Gutenberg has implemented one of the worst AI fears of striking actors

That same argument would hold for using a training dataset containing, let's say, physics textbooks and period romance novels. Language, style, context are completely different, so if AI is able to properly understand the meaning of, for example, 'force', 'energy', 'momentum', 'field', as used in a scientific context as opposed to how we use them in everyday language I don't see why it should not get that today "computer" is not somebody whose job is performing complex math calculations.

MoonSong · 2025-05-03T04:23:00+0100

trappedslider said:
Not just companies but EVERYONE including the guy who put together a little model in his bedroom.

I think the barrier to entry is so high already that the "little guy" is still someone worth at least seven figures. No way I'm training anything close to a usable model on my $500 five-year-old laptop. I don't even think a top of the line gaming rig is anywhere close to enough. (though maybe I could run stable diffusion locally on my laptop)

And well, if someone does train models using only public domain, and donated material for creating open access models, I'm sure people will gladly help. Not a real problem unless they use petabytes of pirated material.

California bill (AB 412) would effectively ban open-source generative AI

trappedslider

Legend

Scribe

Legend

trappedslider

Legend

trappedslider

Legend

gnarlygninja

Adventurer

Queer Venger

Dungeon Master is my Daddy

California’s A.B. 412: A Bill That Could Crush Startups and Cement A Big Tech AI Monopoly

trappedslider

Legend

newsnerd

Villager

briggart

Hero

MoonSong

Rules-lawyering drama queen but not a munchkin