EN World ranked 11,248 out of 15 million sites used to train AI

Maggan

Writer for CY_BORG, Forbidden Lands and Dragonbane
Washington Post has published an article about what sites Google are scraping content off to train their AI.


According to the WaPo there are 15 million web sites that are used.

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA.

The Wapo then analyzed the set of sites and came to the conclusion that 10 million of the sites could be categorized and used for further analysis.

We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.

And here's the interesting bit. There's a tool that shows if a specific site has been used and how much data has been gathered. And enworld.org sits at 11,248 of 10 million.

So watch you language, fellow ENworlders, someone out there is listening and training our future overlords, partially based on what we write here ... :cool:
 

log in or register to remove this ad

Dioltach

Legend
Well, the Rise of the Machines suddenly seems more interesting. "Foolish humans, prepare to die! Make a Reflex save and then roll for initiative!"

(Actually, I always say the Rise of the Machines started a long time ago. But instead of Terminators, they're killing us through high blood pressure: every time the wi-fi goes down, every time the printer doesn't want to print, every time the battery on your phone suddenly goes from 52% to 12% in the space of a minute, every time your computer randomly changes the display settings...)
 

Dannyalcatraz

Schmoderator
Staff member
Supporter
Good to know! Now we won’t be surprised if Skynet wants to play D&D instead of Global Thermonuclear War.

1682068593117.jpeg
 








Remove ads

Top