ChatGPT lies then gaslights reporter with fake transcript

The podcast idea I've used just a little but seems pretty cool. The ability to take a scientific paper and have it explained to me during a commute would be valuable.
Just be aware, that it is bound to hallucinate in this case unless the paper in question is very very old and widely recirculated and improved upon. The most hallucinations happen with niche and novel content.
 

log in or register to remove this ad

The problem with the video is that it's... like a lot of newspaper report, pretty lacking in data. We get a single anecdote (possibly fabricated to convey the point) showing that ChatGPT outputted a wildly hallucinated result about previously entered data, which I am quite accepting since something close happened to me when using it. Except that I wasn't surprised, so I dismissed the hallucination and reprompted by request until it was correctly executed -- so far, I thought it was what a regular person with no particular skill would do, but apparently it's because I have Charles-Xavier level of fluency with AI. Why not, after all. So, let's assume we have a report on a true, single, incident.

It is reported, demonstrating what? That it can happen. Which is correct. It can happen. But what can we draw, as conclusion, on the ability of the software to be good or bad? The journalist claims to have been using every day for a long time before it happens, and probably to his entire satisfaction. So, it is obviously not bad all the time.

Now, let's imagine another news report. Instead of ChatGTP, he newsman explains to his co-host his dealings with a new intern in the staff, Chad Jaypity. He usually doing his summary quite well and everyone like him, but yesterday he was fluking work and denied it, then denied he was asked to do something and gaslighted the newspaper. And the newsman goes on to tell how he doubled down when caught not having done the job.

What could this piece teach us about the ability of humans to be god or bad at a job? Nothing. We can learn that there are occurrences of faulty job by AI or interns, but we don't have enough data to determine the general answer. Is it worthwhile to be warned that humans and AI can output false result? Sure! And books too. And lot of thing. But we can't assess their performance, and that's not what the video is about, from a single result. The video explicitely explains that the news man was satisfied with his use of the tool for a long time before an incident happens, so what is the conclusion? Obviously, it's not "stop using ChatGPT for his work" it's "learn to identify the hallucinations the same way you deal daily with incompetent, slothy subordinate: we don't stop employing people saying "they are bad at their job", we're making the most with the people we work with despite their flaw.

Same with the tool. Is it flawless? Certainly not. Can you gain productivity with it? Certainly. Both examples are in the video. Is the productivity gain worth the productivity loss incurred by checking the result for anything important and dealing with the hallucinations that may happen? This is the key question, which depends on the line of work, the exact tool used, the training provided to the operator of the solution. Those are key questions, totally unadressed in the video, to give an honest answer about whether the tool is useful or not.




LLMs don't search, but professional AI solutions aren't just LLMs. I am part of the team working to assess a legal AI tool by Dalloz, and it is a LLM interface coupled with their database, and they either search it or are trained on very specific content, and it is supposed to be adversarially checking answers againt the database. I don't know yet how much time it will save over regular use of the database, possibly none, possibly some but not enough to be worth the price, but there is also the possibly that the AI solution in a professional environment isn't to just use a 20 USD/month chatgpt toy alone. Or maybe it's not worth using a very expensive tool built upon an LLM and run deepseek for free on your own computer and take the time to deal with the unaccuracies yourself.
Great points. I wonder if the reporter still uses ChatGPT. That would be interesting, wouldn't it? Based on the cautionary tale he provided, why would he if the tool is prone to such lying?

I have a theory...but I don't want to be called out for attacking the reporter. I would bet he's a decent person in real life, as good as most, me included, and he's just doing his job. But contrary to what some may believe, being a news reporter (which I've been) isn't only about covering the news and doing the public good. It's also a business, and it's about doing what your boss tells you to do the way they tell you to.
 

I have a theory...but I don't want to be called out for attacking the reporter. I would bet he's a decent person in real life, as good as most, me included, and he's just doing his job. But contrary to what some may believe, being a news reporter (which I've been) isn't only about covering the news and doing the public good. It's also a business, and it's about doing what your boss tells you to do the way they tell you to.

At some point, we moved from state-controlled news outlet, because they reported only what the government wanted them to report, to a private capital controlled news outlet model, and I think we were too eager or naive in not assuming that they would onl report on what the capital owners would want them to report. Which is not the same thing, it's thing that sells advertizment time and not thing that are acceptable for the authorities, but are a subset of what should be ideally reported. I wonder if there was some research about comparing the actual efficiency or reporting between models, depending on the benevolence of government and interest in public good (ie, not every state is North Korea).
 

Great points. I wonder if the reporter still uses ChatGPT. That would be interesting, wouldn't it? Based on the cautionary tale he provided, why would he if the tool is prone to such lying?

If the video is to be trusted, he will continue to use it. The ending pitch is "AI will change our lives and workplace, but we've got to be careful". So, if it is taken at face value, it will change our lives so it must be widely be used in the future, and we just need to be careful about its use. I could have said the same thing about cars in 1925.
 

Can you run models at home? of course yes, my toaster can potentially run Stable diffusion (and from what I can see some small chat models too). But these LLMs are just plain gigantic for a single home computer. You cannot hope to even run Dalee, Midjourney or Chat GPT locally. Large in this case is closer to (in D&D terms) Colossal. Even running the models takes entire datacenters and is very expensive.
You made a jump there that just isn't true. Yes, it's the tiny models that you can run at home. But just having the ability to do so at home has prompted a lot of improvements. For generative AI art it was home users who worked out LoRAs and their descendants, to add a specific training layer on top of a trained model without the costs of retraining the original model. Lots of improvements because of those limitations. Necessity is the mother of invention and all that.

They've published the costs of bulk queries, and it's not "very expensive" to do a query. Doing tens of millions of queries, especially in deep thinking mode, is a different story, but that requires tens of millions of questions asked by people.

Again, I've professionally managed both owned and co-lo datacenters for global companies. Running a single query against a trained model isn't "very expensive".

Say anything you want, but Open AI actively loses money when servicing paying consumers. So are all of the smaller companies.
If you want to sliding into a separate discussion about the economics of the companies, fine.

Year ago, if you wanted to find a phone number you dialed 411. Back when phone calls were still a big thing. And you got charged for it. Google did a free one (1 800 GOOG 411 if I recall) that would also connect you for free. You just spoke what you wanted, it verified it, and then it would giove you the numebr and offer to connect you. It was a free service.

Did it make money? Lots. Oh, you only can see that calling a free number didn't make them anything. What they were doing was doing voice recognition training on loads of people, different accents and dialects, and getting down proper nouns correctly. It was an extremely economical way to do that, and that paid off nicely.

So if you're only looking that Open AI doesn't make money on current queries, you need to ask yourself why are they doing it? It's not altruism. There's a payoff, even if it's not a direct one.

Remember the old truism, "If something is free, you're the product".
 
Last edited:

I seem to be deeply offensive to many people on this subject, which isn't my intention. I think I've reached a point of acceptance and comfort using AI that many people here haven't yet, and so my blasé references are setting people off. I think most will also get to that point though, because they won't have a choice. Pandora's Box has already been opened.

I understand what marketing is, but I'm past being marketed to. I am a consumer of the product. It sold itself after I started using it.

The company will still market it though, spin the benefits and try to manipulate people into buying it, spending more on it, as will every other AI company. That's less about AI and more about human greed and capitalism.
They might not be marketing it to you in specific. Given that you seem to be an early adopter. Most generally, there's a heavy push by companies that sell these services - and internally by the companies to justify spending so much on them- so that the average office drone uses theses tools and learns to love and rely on them because the average office drone gets easily frustrated by them and just doesn't see the point.
 

You realize that the relationship between the space race and the public is nowhere near the same as the relationship between AI development and the public?

In order for these to be comparable, you'd need a National Artificial Intelligence Administration - NAIA, comparable to NASA, with similar funding, operation, and value return models. Which we very much don't have.
I would agree. As it stands I think that power production would be in danger of going in exactly the opposite direction, as that tech is already available to serve the "immediate (created) need."
 

A few comments I want to respond to in parallel:
If I ask an LLM what files are in a folder, the LLM instead answers the question, "what response is, in some sense, closest to the request for 'what files are in that folder?'?" Where "closest to" is a measure currently resting in the black box of billions of weights and connections. The LLM may not have been exposed to the actual contents of the folder for weeks, but will still return what it has been trained is among the most probable results.

So, it effectively guesses, and shows you that guess.
LLMs don't search, but professional AI solutions aren't just LLMs.
Just be aware, that it is bound to hallucinate in this case unless the paper in question is very very old and widely recirculated and improved upon. The most hallucinations happen with niche and novel content.
I don't think the way other people are describing LLMs being used here matches my use of them and that explains some of the disconnect. LLMs can search, and search is crucial to using their capabilities for research. When I say I use them for research I am not asking for them to return the text of a paper or to rely on their training data to give me an answer. I'm asking them to perform a live search and to return results that correspond to my query, highlighting the relevant parts. Then I will go and look at the answer myself.

You should never ask a LLM to give you extended sections of direct text, like "give me the entire transcript of this podcast".
 

You should never ask a LLM to give you extended sections of direct text, like "give me the entire transcript of this podcast".

Sure, and an LLM would certainly fail at some point, but packaged AI solutions are promising to do so. I'd expect them not to rely only on LLM training, especially when working on professional, potentially sensitive, data. Or even, public, readily available data. If you check a legal reference with ChatGPT, he might give you the correct answer and state an invalid reference, then when asked for backup, perform the search and come up with "sorry, this isn't article 2248-30 it is article 1749-51 of this other code." Or "I couldn't find a reference for the statement I made." Which is usually enough to identify an hallucination.
 

And that's where hallucination comes in. When asked what time the transcript was uploaded, it didn't check" when it was uploaded. It found what, in its black box, was the *most likely text response for "when was it uploaded". And, having no actual understanding of time, or the question, or what a "file" or "uploading" are, it cannot ask of itself whether the answer makes sense, because it has no concept of sense or nonsense in and of itself.

It argues over its correctness, not because it is argumentative, but because that's what it is trained is a most likely text response to text that challenges correctness!
Likewise, I want to respond to this specifically. I agree with your entire description of the limitations based on mechanism, but I don't agree that this causes the technology to lack value. It just means it should be used properly. If you interface with a search function and ask it for references, then matching "what my question looks like" to "what the text of the sources the search function returned" are is actually incredibly useful.

I think the best arguments are empirical. I mean, look at that video in the OP! Does that look like it is ready for prime time to you?
A post later, you critique me for providing an anecdote. It's a fair critique, but, please.

Several studies, in both the prose/technical writing and code writing domains, which looked beyond focused task completion, that found including genAI reduced overall productivity when genAI was included as a major tool. In essence, any improvements seen in completing one task is overwhelmed by the effort needed to correct the errors genAI introduced downstream from that task completion.
I'm familiar with the coding study. I remember it being criticized because it was focused on users who were experts at the code bases they were working on, while the ideal LLM use case is someone who is generally skilled in a language working with a new codebase.

That said, I'm not going to rule out the thesis that all perceived benefits of LLMs are just due to people having favorable, subjective experiences which don't match reality. But I think that is far from proven.
 

Remove ads

Top