Maggan
Writer for CY_BORG, Forbidden Lands and Dragonbane
Washington Post has published an article about what sites Google are scraping content off to train their AI.
According to the WaPo there are 15 million web sites that are used.
The Wapo then analyzed the set of sites and came to the conclusion that 10 million of the sites could be categorized and used for further analysis.
And here's the interesting bit. There's a tool that shows if a specific site has been used and how much data has been gathered. And enworld.org sits at 11,248 of 10 million.
So watch you language, fellow ENworlders, someone out there is listening and training our future overlords, partially based on what we write here ...
According to the WaPo there are 15 million web sites that are used.
To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA.
The Wapo then analyzed the set of sites and came to the conclusion that 10 million of the sites could be categorized and used for further analysis.
We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
And here's the interesting bit. There's a tool that shows if a specific site has been used and how much data has been gathered. And enworld.org sits at 11,248 of 10 million.
So watch you language, fellow ENworlders, someone out there is listening and training our future overlords, partially based on what we write here ...
