Judge decides case based on AI-hallucinated case law

LLMs seem good at summarizing and detecting relevant information. They are also pretty good at presenting information in a coherent report. They will absolutely make mistakes, just like in this thread's title. But weigh that against the number of times a human lawyer makes a mistake. We haven't seen decent studies yet, but I am not convinced that the rates are very different.

Given the number of times people got to get back to the doctor because they didn't get treated the first time, which is a common memory, I guess the error rate is not insignificant -- though the risk of fatal error rate is certainly low.

This document from the French HAS (health ministry agency) does a summary of existing studies, dated end of 2024. It include studies from the US. Key points IMHO:
  • 59% of medical errors in the US were caused by late or missed diagnostic (which might be because of the cost of medicine),
  • The rate of incorrect diagnostic depends heavily on where you are. In hospital, it's 0.7%, but it goes up to 5%-10% in other places.
  • A study covering several countries showed rates going from 7% to 36% of problematic events linked to curing (from diagnostic error to incorrect treatment), the higher rate being in first care.
  • Most diagnosis errror pertain to commonly encountered pathologies (which may or may not be reprensentative of how common they are, it might mean that diagnostic errors can't just be explained by doctors stymied by a rare disease, House MD-style).
 

log in or register to remove this ad

Nobody is proposing, as far as I know, to trust a general purpose LLM with a diagnosis. However, if I wanted to know about a few illness that cause flu-like symptoms, not because I want a diagnosis but because I want to read about some exotic diseases out of boredom or to create the outline of an adventure involving someone having a cough turning out to be deadly, I'd prefer the AI to give me the most accurate information it can instead of being told either hallucinations about flu-like symptom being an early warning of transforming to a Deep One, or being told to go see a doctor to chat.
The problem is, people ARE trusting general purpose LLMs to make diagnoses or suggest dangerous self-help actions. We’ve seen that with the glue on pizza and taste-testing of mushrooms incidents. Those got caught, which is why we know about them.

But like the aforementioned site that tracks AI hallucinations in legal settings, there are probably more unreported cases like the glue & ‘shrooms cases that aren’t being reported for a variety of reasons of reasons. And right now, we don’t have any idea what the ratio of good to bad advice is beyond self-reports from lower-stakes occurrences

As to your second point- I hadn’t read those, and the numbers they cite are better than I’d seen recently. I actually find them promising!👍🏽
 
Last edited:

Open Evidence -- many, many doctors feel this significantly helps them in their diagnosis. I don't have access so I cannot test it (I'm a doctor of statistics, working in the medical field), but the weight of opinion seems to be that it really does help a doctor make a diagnosis significantly.
Those are doctors evaluating the responses the AIs are giving, not Joe Public. There’s a chasm of difference in expertise & experience between the two groups.
LLMs seem good at summarizing and detecting relevant information. They are also pretty good at presenting information in a coherent report. They will absolutely make mistakes, just like in this thread's title. But weigh that against the number of times a human lawyer makes a mistake. We haven't seen decent studies yet, but I am not convinced that the rates are very different.
I was admitted to the Texas Bar in 1994, and every month, I get the Bar Journal, which devotes a section reporting on who got sanctioned and for what reason. And that section is NEVER empty.

Lawyers who actually fabricate evidence or cases from whole cloth are vanishingly rare. It’s too easily caught. It’s a much better gamble to shade the truth about what a case stands for than to simply make stuff up.

In comparison, the aforementioned site tracking AI hallucinations in legal cases has reported more cases THIS YEAR than I’ve seen lawyers do unassisted in my career.
 

LLMs seem good at summarizing and detecting relevant information. They are also pretty good at presenting information in a coherent report. They will absolutely make mistakes, just like in this thread's title. But weigh that against the number of times a human lawyer makes a mistake. We haven't seen decent studies yet, but I am not convinced that the rates are very different.
I'd guess the rates are very different for unsupervised use, at the moment. And the LLM errors are more egregious than what lawyers would typically do--I doubt they are making fake cases that often.

My prior for how to sanction LLM use is to apply the same standards you would without LLM use. It is the responsibility of the user to use it responsibly.

What would be the penalty for a lawyer who made up a case whole cloth?

For supervised LLM use, once people gain the experience to integrate it appropriately into their work, I would expect a decrease in errors relative to the no LLM case.
 

Well, the laws protecting people against themselves are rarely popular among every single person (if they were, maybe a law wouldn't be necessary). That's why some think that a representative governement is better than a direct democracy, because it's easier to inform a small circle of lawmakers than the whole population. I don't think people reject the seatbelt based on all available information, they do while often tending to downplay the risks linked to driving without it (or they think they are good drivers, which isn't the point, the seatbelt protects you against all the other bad drivers) and so on.

The goal of the government is to find the best balance between individual liberty, healthcare concern, cost... and popularity (in democracies).
An interesting aspect of seat belt laws is also: It forces every car manufacturer to have them. If it wasn't a law that you needed them, maybe it would only be available as a premium package. Of course, the seat belts make every car more expensive, but when it's just an option, it can change the availability considerable.
Basically you get more death by traffic accidents, and they happen to be mostly poor. Most extreme case, safe, fearless rich people's kids recklessly drive to crash into poor people and killing their kids.
 

But regulations are not statements of fact. They weigh many competing concerns, like whether they violate any rights and how to balance economic vs public health goals.
A perceived need is typically the basis for initiating the law drafting process. Usually- but not always- that need is factually based. Everything else follows after.
Nor is my point "if anyone is insulted you can't make a law". But that how people respond to a regulation is one of many things you must weigh.
Optics are always an issue, certainly, but they’re lower priority than whether a law will achieve the desired goal.
That is the exact opposite of what has taken place in this thread. The argument made was explicitly "we have to protect uneducated citizens from the danger".
“We aren’t legislators” is the glib response.

Of course legislators are human, and will go through similar discussions as we’ve done here. But it definitely won’t be on the record.

If you think I am talking about the minutes of legislation or the text of surgeon general warnings you have not understood me. I'm talking about the topics raised when discussing legislation. We are having such a discussion right now. And the arguments being offered are about the ability of people to evaluate medical claims accurately.
The fact that untrained individuals are at a severe disadvantage in understanding medical advice is extremely well documented, over centuries, worldwide.

Public health measures have saved more humans than pharmaceuticals & surgeries combined, and yet they’re the area in which there’s the most pushback- even among experts. Think variolation against smallpox in Boston in the 1700s. Or the mid-1800s London cholera outbreak linked to contaminated water pumps. Or mask wearing against airborne diseases. Or hand washing for doctors (and then everyone else). Sterilization of surgical equipment. Indoor plumbing.

Nonspecialized AIs dispensing medical advice is every bit as much a public health issue as any of those.
 

The problem is, people ARE trusting general purpose LLMs to make diagnoses or suggest dangerous self-help actions. We’ve seen that with the glue on pizza and taste-testing of mushrooms incidents.

As long as enough don't, we're golden. The glue on pizza and taste-testing mushrooms are the exploding pressure-cookers of decade pasts, a risk that existed, but is no longer assessed again, in the future. They are all corrected and yet will be talked for years. Right now, LLMs are lacking training on enough legal datathat they can't provide specialized legal advice (only broad general description), and that's where they struggle -- for now.

Those got caught, which is why we know about them. But like the aforementioned site that tracks AI hallucinations in legal settings, there are probably more unreported cases lie the glue & ‘shrooms cases that aren’t being reported for a variety of reasons of reasons. And right now, we don’t have any idea what the ratio of good to bad advice is beyond self-reports from lower-stakes occurrences

This is quite easy to test, given that there is ample computing power available to get the error rate if needed by submitting them synthetic questions. Not that anyone would be interested into running a trial test, unfortunately. There are correcting measures: simple ask an LLM to analyze and check what the first LLM outputs. It will catch most hallucinations (possibly providing other to be corrected by the first).

With regard to non-specialized AI dispensing medical advice, do have a rate of accidents linked to improperly understanding medical advice on Google? Having accidents happening as a result of a misunderstanding of what an LLM dispenses is bad, but if it is identical or less than the alternative (people googling for health advice from a random board, which apparently two third of Internet users do), then maybe it's a public health improvement over the current situation.

As to your second point- I hadn’t read those, and the numbers they cite are better than I’d seen recently. I actually find them promising!👍🏽

What is true of an LLM of 2023 isn't true about an LLM of end 2024 and what was true in late 2024 might not be true mid-2025. Forbidding LLMs to talk about a topic will disincentivize improvement in this field (a lawmaker will be loath to take the political risk to allow something that is previously disallowed...) so there will be less research to improve the products. Ackowledging that they are imperfect chatters and nothing more at the point until they can be proofed for more serious use is certainly the best way to go.
 
Last edited:

“We aren’t legislators” is the glib response.

Of course legislators are human, and will go through similar discussions as we’ve done here. But it definitely won’t be on the record.
No. But we vote for legislators. Your argument seems to me "it is not a concern if the underlying arguments for regulations are perceived as condescending because these arguments will not be entered into the public record". I don't find that compelling. I think people are aware enough to know what the actual motivation is and prickly enough to be offended by it.

The fact that untrained individuals are at a severe disadvantage in understanding medical advice is extremely well documented, over centuries, worldwide.

Public health measures have saved more humans than pharmaceuticals & surgeries combined, and yet they’re the area in which there’s the most pushback- even among experts. Think variolation against smallpox in Boston in the 1700s. Or the mid-1800s London cholera outbreak linked to contaminated water pumps. Or mask wearing against airborne diseases. Or hand washing for doctors (and then everyone else). Sterilization of surgical equipment. Indoor plumbing.
I agree with all of this.

Nonspecialized AIs dispensing medical advice is every bit as much a public health issue as any of those.
I think it is much less threatening. We will have to wait a bit longer I suppose. But I do not see evidence that generative AI is a comparable public health threat to smallpox.

There will be a nonzero number of people who use it uncritically and do harmful things as a result. I weigh that--and my own assumptions about how common and damaging this will actually be--against the risks of perceived government censorship and the continued collapse in trust in experts. Right now I see a futher decline in trust to be a greater issue. It is directly responsible for the measles outbreak, for example (well over 1,000 cases so far). And that is a comparatively easy problem to solve from a technical perspective (the vaccine works) to other threats.

I think if you are interested in public health the #1 priority has to be regaining the trust of your audience.
 

However, if I wanted to know about a few illness that cause flu-like symptoms, not because I want a diagnosis but because I want to read about some exotic diseases out of boredom or to create the outline of an adventure involving someone having a cough turning out to be deadly,
I want to revisit and highlight this.

What you’re describing ISN’T medical advice, it’s factual/historical info. That stuff is freely available on numerous websites, including Wikipedia.

That info not what we’re discussing prohibiting general use AIs from dispensing. We’re talking about prohibiting them from responding with advice. BIG difference.
 

As long as enough don't, we're golden. The glue on pizza and taste-testing mushrooms are the exploding pressure-cookers of decade pasts, a risk that existed, but is no longer assessed again, in the future. They are all corrected and yet will be talked for years. Right now, LLMs are lacking training on enough legal datathat they can't provide specialized legal advice (only broad general description), and that's where they struggle -- for now.
While those particular instances may have been addressed, they’re just the tip of the iceberg. There’s certainly less egregious examples out there, unreported, as well as future similarly dangerous incidents to come.

And more data isn’t necessarily the solution. There’s such a mass of legal cases and complexity in law that many cases never see publication, or even the inside of a courtroom.

I mentioned the ticking time bomb of decades of as-yet unlitigated clauses in oil & gas cases. There’s nothing to train an AI on because there’s ZERO case law- just the unpublished opinions of O&G teachers & analysts. There’s similar clauses lurking in other industries as well.

Despite being presumptively unconstitutional, several states nonetheless have laws preventing atheists from holding public office. But those laws are so far untested. What would an AI tell an atheist considering running for mayor in such a state?

Hell- the last case I had in probate court involve a situation so rare that the judge had never seen it…but his clerk had. Not in a published case, in her 30+ years of employment. There was no published case law. She had to tell him how a previous judge had handled the case.
This is quite easy to test, given that there is ample computing power available to get the error rate if needed by submitting them synthetic questions. Not that anyone would be interested into running a trial test, unfortunately. There are correcting measures: simple ask an LLM to analyze and check what the first LLM outputs. It will catch most hallucinations (possibly providing other to be corrected by the first).
I will note that past AIs have had difficulty detecting and reporting they had made errors, like miscounted the number of “r”s in “strawberries”. While corrected, there’s a certain level of insanity in trusting technology known to be error prone or hallucinate to run diagnostics to detect errors or hallucinations.


With regard to non-specialized AI dispensing medical advice, do have a rate of accidents linked to improperly understanding medical advice on Google? Having accidents happening as a result of a misunderstanding of what an LLM dispenses is bad, but if it is identical or less than the alternative (people googling for health advice from a random board, which apparently two third of Internet users do), then maybe it's a public health improvement over the current situation.
I don’t know that there’s been systematic research on the harm that “Doctor Google” does, just anecdotes.

But one anecdote I know of from discussing CME with my father was that many doctors blamed part of the overprescribing of antibiotics on patients demanding them based on “their research” and threatening to walk out if they didn’t get them. So (some) doctors would prescribe a short course of antibiotics along with whatever their affliction ACTUALLY demanded.

(Overprescribing antibiotics reduces the effective product life of that antibiotic in particular as resistance increases, as well as contributing to the rise of other antibiotic resistant bacteria over time. The more we use them, the faster we lose them.)
What is true of an LLM of 2023 isn't true about an LLM of end 2024 and what was true in late 2024 might not be true mid-2025. Forbidding LLMs to talk about a topic will disincentivize improvement in this field (a lawmaker will be loath to take the political risk to allow something that is previously disallowed...) so there will be less research to improve the products. Ackowledging that they are imperfect chatters and nothing more at the point until they can be proofed for more serious use is certainly the best way to go.
Not “talk about”, “advise”. Completely different standards.
 

Pets & Sidekicks

Remove ads

Top