EN World ranked 11,248 out of 15 million sites used to train AI

Maggan · Apr 21, 2023

Washington Post has published an article about what sites Google are scraping content off to train their AI.

https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

According to the WaPo there are 15 million web sites that are used.

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA.

The Wapo then analyzed the set of sites and came to the conclusion that 10 million of the sites could be categorized and used for further analysis.

We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.

And here's the interesting bit. There's a tool that shows if a specific site has been used and how much data has been gathered. And enworld.org sits at 11,248 of 10 million.

So watch you language, fellow ENworlders, someone out there is listening and training our future overlords, partially based on what we write here ...

Dioltach · Apr 21, 2023

Well, the Rise of the Machines suddenly seems more interesting. "Foolish humans, prepare to die! Make a Reflex save and then roll for initiative!"

(Actually, I always say the Rise of the Machines started a long time ago. But instead of Terminators, they're killing us through high blood pressure: every time the wi-fi goes down, every time the printer doesn't want to print, every time the battery on your phone suddenly goes from 52% to 12% in the space of a minute, every time your computer randomly changes the display settings...)

Dannyalcatraz · Apr 21, 2023

Good to know! Now we won’t be surprised if Skynet wants to play D&D instead of Global Thermonuclear War.

Dioltach · Apr 21, 2023

... And the only way to "win" D&D, as per BECMI, is to start at level 1 and reach the highest possible level as an Immortal twice. Let's see an AI hang on to its playing group long enough to manage that!

Zardnaar · Apr 21, 2023

Dannyalcatraz said:
Good to know! Now we won’t be surprised if Skynet wants to play D&D instead of Global Thermonuclear War.

View attachment 282501

That show ended to soon. o7 Cromarty.

rgoodbb · Apr 21, 2023

"Shall we play a game..."

"I'm sorry Dave, I'm afraid I can't do that."

delericho · Apr 21, 2023

Thank you all for your contributions. It has been instructive.

Dioltach · Apr 21, 2023

At least we don't have to feel like Edgin with the Intellect Devourers.

Morrus · Apr 21, 2023

Well that's wierd.

Dioltach · Apr 21, 2023

Are there any rules about AI companies mining online information for training their software? Do they need permission?

EN World ranked 11,248 out of 15 million sites used to train AI

Maggan

Writer for CY_BORG, Forbidden Lands and Dragonbane

Dioltach

Legend

Dannyalcatraz

Schmoderator

Dioltach

Legend

Zardnaar

Legend

rgoodbb

Adventurer

delericho

Legend

Dioltach

Legend

Morrus

Well, that was fun

Dioltach

Legend

Similar Threads