Umbran, if you don't mind me asking, which types of models were used in your thesis work?
Old ones, I admit - my thesis days are a long way back. However...
I'm curious because there is an argument, (which I think you reject) that transformers changed the game in this regard by accounting for context in a way that previous architectures didn't, and that has led to some exciting emergent properties.
So, newer architectures have produced "exciting emergent properties", yes. I don't argue that these systems cannot handle massively more complicated data than they could in my research days.
But, the new architectures do not change the fundamental operation of the system - which is to produce a probabilistic approximation or simulation of what is requested, based on the training materials, with no actual understanding of the request. It returns a thing that looks like an answer, instead of an actual answer.
The video Morrus gave us in the OP is a clear example.
If I ask a filesystem or operating system what files are in a folder, it will go, fetch the actual filenames currently present, and show me those names, and metadata associated with them.
If I ask an LLM what files are in a folder, the LLM instead answers the question, "what response is, in some sense, closest to the request for 'what files are in that folder?'?" Where "closest to" is a measure currently resting in the black box of billions of weights and connections. The LLM may not have been exposed to the actual contents of the folder for weeks, but will still return what it has been trained is among the most probable results.
So, it effectively guesses, and shows you that guess.
And that's where hallucination comes in. When asked what time the transcript was uploaded, it didn't check" when it was uploaded. It found what, in its black box, was the *most likely text response for "when was it uploaded". And, having no actual understanding of time, or the question, or what a "file" or "uploading" are, it cannot ask of itself whether the answer makes sense, because it has no concept of sense or nonsense in and of itself.
It argues over its correctness, not because it is argumentative, but because that's what it is trained is a most likely text response to text that challenges correctness!
Or would you be willing to say more about why you don't think new architectures have an impact here?
I think the best arguments are empirical. I mean, look at that video in the OP! Does that look like it is ready for prime time to you?
I have also come across several other measures of note:
The PMI (Project Management Institute, the most accepted authority on project management techniques) notes that about 80% of enterprise genAI projects fail*. The two basic reasons for failure are 1) Does not deliver the expected value and 2) in effect, the customer was sold a solution looking for a problem, rather than staring with a real problem that the customer knew needed a solution.
Several studies, in both the prose/technical writing and code writing domains, which looked beyond focused task completion, that found including genAI reduced overall productivity when genAI was included as a major tool. In essence, any improvements seen in completing one task is overwhelmed by the effort needed to correct the errors genAI introduced downstream from that task completion.
*Failure, for the PMI, is about going far over time, over budget, or not having proper return on investment.