Reply to thread

Message: <blockquote data-quote="EzekielRaiden" data-source="post: 9323771" data-attributes="member: 6790260">I think we're talking past each other.In the context of the kinds of data being processed, semantic content, the "meanings" of things, is where things like evidence, credence, warrant, etc. lie. Unless you have the ability to look at what something means, and not just what its sequence is, you cannot even begin to consider "evidence" and "warrant" etc. If a human were restricted to exclusively making arguments based on the sequence in which (parts of) words appear, without ever being able to ask what any part actually means, they would never be able to even begin talking about whether one word-part (or set thereof) is warranted or not. To speak of warrant, you must know what the words mean; computers do not know that. They only "know" likely and unlikely sequences in which (parts of) words get used by people.As a general rule, the way language-related* AIs work right now "tokenizes" (breaks up and indexes) the words humans use in some given language (such as English) into subword chunks, "tokens," which can then be combined. Using full words is too ponderous, since many sub-word parts (like "ing" and "tion") show up in bazillions of words, while using individual characters fails to capture enough useful structure in how the language is used by people. Subword tokenization is the norm today, mostly because it's significantly more scalable, and LLMs like GPT derive most of their value from having been scaled up to a large size.The token-space for GPT-3, which is an older model since superseded by GPT-4, is somewhere above 12000 tokens, I don't remember the precise value. That means that GPT-3 has crunched through untold billions of lines of human-written text, identified 12000+ word-bits that get used often enough to be worth noting, and indexed them in a set of enormous matrices (really, sets of matrices of a few different sizes, because it turns out you can save a lot of space by doing certain calculations with smaller matrices). Whenever GPT runs--for example, when you give ChatGPT a prompt and ask it to respond--it takes your prompt input, breaks it up into the tokens it recognizes, and then passes the resulting 12000-dimensional vector through dozens of layers of matrix multiplication, with each layer filled with thousands of training-tweaked weights on the matrix multiplication.These weights capture syntactic relationships between the tokens--they encode the ways that those tokens were used in actual words written by actual humans, the sequences and relationships between sequences. That "training" mostly worked by having the AI guess the next word in already-written text, usually failing, and having its numbers altered until it "correctly" "predicts" the next word actually used. Do this trillions of times, and you build up a mountain of matrix multiplication layers that can encode relatively long, relatively far-reaching syntactic relationships. And some of these can be quite impressive, like how GPT-2 invented a fictitious professor from the University of La Paz for that "unicorns in the Andes who could speak perfect English" prompt.But there's a limit to how much the thing can see. It can only "see" a list of tokens up to the size of its input vector (which, as stated, is lover 12000 tokens). That's a huge number of tokens....and nowhere near enough to cover even a novella, let alone a truly long-form work like a textbook. Once you grow beyond that line, GPT and all models like it will start to "lose the plot," because they do not, cannot, actually understand what words mean. They can only "understand" (encode) sequences of words.If you can't understand what even one single word means, the only portion of truth-value you can ever reach is logical validity, since an argument is valid purely on the basis of its syntactic structure, whether it has the right form, regardless of its semantic content, whether it has the right meaning.*Some, trained for translation, tokenize for multiple languages. Others that are image-related tokenize for patterns of pixels. So on.</blockquote>

Verification