Judge decides case based on AI-hallucinated case law

although the fact that they can warn about certain topics certainly seems to indicate that certain topics could be flagged and not provided for misuse at all.

Certain topics are flagged (Tien-an-Men or the criticism of Xi Jin Ping in China, for example). But refusing to answer question about laws will also prevent laymen curious about a court order to provide it to the tool to get an explanation "like I am 5 years old". We'll see in the future (and again, I don't think there will be a global consensus) on what the balance between functionality and risk of use by grossly incompetent professionals.

But we don't forbid websites to provide dubious information on health and law.
 

log in or register to remove this ad

Okay. So no relevant additions to the discussion. We are at an impasse. My suggestion is that there is no great way to test ai outside of a live environment.

Well, that heart-lung machine I mentioned earlier - do you figure the only tests ever done on it were directly in surgery? And that when they began tests in surgery, they just sold them to whoever and waited for someone to complain? Of course not! That thing was tested in its various individual parts, then the the thing was tested on liquids the same viscosity as blood. Then they put animal blood in it, then human blood - all while attached to machines that monitor the temperature, pressure, oxygenation and so forth. The blood will be examines after being run through the machine, to make sure the cells are undamaged.

Putting a human whose life depends on it is only a final stage, and that is done in carefully controlled trials, with backups if something goes wrong. And the results of all this is reviewed by outside experts, who have no financial stake in the results, before the machine will be certified for general use in hospitals.

Your suggestion is that there must be a good way but that you cannot comment on what it might be due to lack of expertise. There’s nothing more to say here.

You had very specific questions, like "What counts as pass/fail", which I cannot answer in a general sense. I do software project management. I can only speak to software QA broadly, and highly simplified manner.

For example, you don't make an entire application, and then throw it out to the market untested and wait for users to complain about errors.

Testing software is a multi-faced, multi-layered thing. There is testing done by developers to make sure smaller sections of code work as expected (often called "unit testing"). There is testing on a larger scale, that checks to see if separate parts of an application interact as expected (often called "integration testing", "system integration testing" or "SIT testing"). Then there's testing in which we check to see that what results the end-user gets are what are expected/desired (usually referred to as "functional testing").

Also, software has multiple types of environments it can exist in. There are development environments in which developers work, that are highly dynamic and change rapidly as engineers make changes to get things to work. There are QA environments in which most SIT and functional testing happens. There's "staging" environments that are the place software goes (and can again be tested) that are typically as much like the environment the public sees as possible, and then finally there's "production", which is where you and I see it, available to the public.

You don't generally test in production. End-users are already getting at it there, and any problems found there are errors that end-users see and are impacted by, and thinking horrible things about your company as they go wrong. You always want to find errors before software gets to production.

Testing is not "now start using it randomly and report when something goes wrong". QA professionals are exacting, and methodical. They write hundreds, thousands, and tens of thousands of test cases, checking hundreds, thousand, and tens of thousands of individual behaviors of the system. For a big system, if you are serious, those human-written test cases are fed into a system that automates executing the tests and checking if the result matches what the QA engineer said it should, and marks the test as failed if it doesn't. That defines a bug, that gets handed back to the developers to fix. Lather-rinse-repeat until the tests all pass.

Where do those tests cases come from? People who make software have product managers who define what the software is supposed to do. For, say, an application that's supposed to support a doctor in diagnosing ailments, they'd define what ailments are in the list that the system is supposed to be able to catch, and upon what basis they are to suggest a diagnosis. QA will test whether the system gives the right results, or wrong results.

There should be no reason to do all this testing in a "live", meaning a production, system. You do it back behind the scenes in a controlled QA environment, with databases just like you would have in production, and so forth.
 

Certain topics are flagged (Tien-an-Men or the criticism of Xi Jin Ping in China, for example). But refusing to answer question about laws will also prevent laymen curious about a court order to provide it to the tool to get an explanation "like I am 5 years old". We'll see in the future (and again, I don't think there will be a global consensus) on what the balance between functionality and risk of use by grossly incompetent professionals.

But we don't forbid websites to provide dubious information on health and law.

We don't prohibit websites from providing information.

But if a website offers to draft a reasonable looking document for you to file on your behalf? That is wrong?

Same thing all around. I wouldn't prohibit a website from providing a person with information about a health topic. But as soon as it gets to personalized diagnostics for that person, it crosses a line. And so on.


That's the line that we see being crossed. If you want to argue, "Hey, they are just providing information, no big deal." Fine. But here's the thing- I keep seeing people argue that AI - which means the corporate entities making them should be entitled to do whatever they want, and reap all the rewards, while all the risks are crammed down to the users.

I am not comfortable with the business model, ethically or legally.


ETA- another area where this is coming up is defamation. When the AI (the corporate entity) publishes a defamatory statement, where is the liability? Because liability attaches to the published (the corporate entity).
 


Well, that heart-lung machine I mentioned earlier - do you figure the only tests ever done on it were directly in surgery? And that when they began tests in surgery, they just sold them to whoever and waited for someone to complain? Of course not! That thing was tested in its various individual parts, then the the thing was tested on liquids the same viscosity as blood. Then they put animal blood in it, then human blood - all while attached to machines that monitor the temperature, pressure, oxygenation and so forth. The blood will be examines after being run through the machine, to make sure the cells are undamaged.

Putting a human whose life depends on it is only a final stage, and that is done in carefully controlled trials, with backups if something goes wrong. And the results of all this is reviewed by outside experts, who have no financial stake in the results, before the machine will be certified for general use in hospitals.



You had very specific questions, like "What counts as pass/fail", which I cannot answer in a general sense. I do software project management. I can only speak to software QA broadly, and highly simplified manner.

For example, you don't make an entire application, and then throw it out to the market untested and wait for users to complain about errors.

Testing software is a multi-faced, multi-layered thing. There is testing done by developers to make sure smaller sections of code work as expected (often called "unit testing"). There is testing on a larger scale, that checks to see if separate parts of an application interact as expected (often called "integration testing", "system integration testing" or "SIT testing"). Then there's testing in which we check to see that what results the end-user gets are what are expected/desired (usually referred to as "functional testing").

Also, software has multiple types of environments it can exist in. There are development environments in which developers work, that are highly dynamic and change rapidly as engineers make changes to get things to work. There are QA environments in which most SIT and functional testing happens. There's "staging" environments that are the place software goes (and can again be tested) that are typically as much like the environment the public sees as possible, and then finally there's "production", which is where you and I see it, available to the public.

You don't generally test in production. End-users are already getting at it there, and any problems found there are errors that end-users see and are impacted by, and thinking horrible things about your company as they go wrong. You always want to find errors before software gets to production.

Testing is not "now start using it randomly and report when something goes wrong". QA professionals are exacting, and methodical. They write hundreds, thousands, and tens of thousands of test cases, checking hundreds, thousand, and tens of thousands of individual behaviors of the system. For a big system, if you are serious, those human-written test cases are fed into a system that automates executing the tests and checking if the result matches what the QA engineer said it should, and marks the test as failed if it doesn't. That defines a bug, that gets handed back to the developers to fix. Lather-rinse-repeat until the tests all pass.

Where do those tests cases come from? People who make software have product managers who define what the software is supposed to do. For, say, an application that's supposed to support a doctor in diagnosing ailments, they'd define what ailments are in the list that the system is supposed to be able to catch, and upon what basis they are to suggest a diagnosis. QA will test whether the system gives the right results, or wrong results.

There should be no reason to do all this testing in a "live", meaning a production, system. You do it back behind the scenes in a controlled QA environment, with databases just like you would have in production, and so forth.

Do you think AI wasn’t tested at all? Or just not to your personal satisfaction?
 

If AI is as unreliable as supposed -- I wouldn't call it demonstrated by a single example that isn't proven to involve AI at all, that's not a very good threshold,
It really all hinges on the first point, it seems to me.
There’s a reason Umbran started this thread.

We know that some of the cases in the original pleadings were AI hallucinations. This particular thread is just talking about one particular case involving the law, but it’s not even the only one mentioned on this board. I’ve personally brought up other cases in other threads. There’s a few other MAJOR cases that have popped up in the law I personally know of, and Legal Eagle has done at least 3 videos on the subject.

And there have been OTHER threads & comments posted on ENWorld about AI hallucinations in other fields of work.

Total aside. ChatGPT hallucinates almost everything. You cannot rely on it.

No idea why the link keeps breaking. Remove the @s.

http@s://www.face@book.@com/100067163132100/videos/927490359367120
 
Last edited:

Yes. And maybe that's a good thing. Maybe generative AI should not be summarizing information that is associated with major consequences if you get it wrong.

And we need to remember that in this case, not only does the person not have the skills to understand the order, the person would lack the skills to understand what the AI gets wrong.

I've used AIs to do occasional legal research, and what I have repeatedly understood is not how impressive their abilities are- instead, it's that every single time I've tried it, the AI has gotten something fundamentally wrong that requires knowledge about the subject area of the law to realize it wasn't accurately summarizing the information. In some cases, it produces the exact opposite of the correct answer.
 

ETA- another area where this is coming up is defamation. When the AI (the corporate entity) publishes a defamatory statement, where is the liability? Because liability attaches to the published (the corporate entity).

Or the non-corporate Chinese entity that published for free the AI software that you run on your own computer in the case of, say, Deepseek.
 

Well, that heart-lung machine I mentioned earlier - do you figure the only tests ever done on it were directly in surgery? And that when they began tests in surgery, they just sold them to whoever and waited for someone to complain? Of course not! That thing was tested in its various individual parts, then the the thing was tested on liquids the same viscosity as blood. Then they put animal blood in it, then human blood - all while attached to machines that monitor the temperature, pressure, oxygenation and so forth. The blood will be examines after being run through the machine, to make sure the cells are undamaged.

Putting a human whose life depends on it is only a final stage, and that is done in carefully controlled trials, with backups if something goes wrong. And the results of all this is reviewed by outside experts, who have no financial stake in the results, before the machine will be certified for general use in hospitals.



You had very specific questions, like "What counts as pass/fail", which I cannot answer in a general sense. I do software project management. I can only speak to software QA broadly, and highly simplified manner.

For example, you don't make an entire application, and then throw it out to the market untested and wait for users to complain about errors.

Testing software is a multi-faced, multi-layered thing. There is testing done by developers to make sure smaller sections of code work as expected (often called "unit testing"). There is testing on a larger scale, that checks to see if separate parts of an application interact as expected (often called "integration testing", "system integration testing" or "SIT testing"). Then there's testing in which we check to see that what results the end-user gets are what are expected/desired (usually referred to as "functional testing").

Also, software has multiple types of environments it can exist in. There are development environments in which developers work, that are highly dynamic and change rapidly as engineers make changes to get things to work. There are QA environments in which most SIT and functional testing happens. There's "staging" environments that are the place software goes (and can again be tested) that are typically as much like the environment the public sees as possible, and then finally there's "production", which is where you and I see it, available to the public.

You don't generally test in production. End-users are already getting at it there, and any problems found there are errors that end-users see and are impacted by, and thinking horrible things about your company as they go wrong. You always want to find errors before software gets to production.

Testing is not "now start using it randomly and report when something goes wrong". QA professionals are exacting, and methodical. They write hundreds, thousands, and tens of thousands of test cases, checking hundreds, thousand, and tens of thousands of individual behaviors of the system. For a big system, if you are serious, those human-written test cases are fed into a system that automates executing the tests and checking if the result matches what the QA engineer said it should, and marks the test as failed if it doesn't. That defines a bug, that gets handed back to the developers to fix. Lather-rinse-repeat until the tests all pass.

Where do those tests cases come from? People who make software have product managers who define what the software is supposed to do. For, say, an application that's supposed to support a doctor in diagnosing ailments, they'd define what ailments are in the list that the system is supposed to be able to catch, and upon what basis they are to suggest a diagnosis. QA will test whether the system gives the right results, or wrong results.

There should be no reason to do all this testing in a "live", meaning a production, system. You do it back behind the scenes in a controlled QA environment, with databases just like you would have in production, and so forth.
Narrated by that SDET friend I mentioned, previously.

 

Yes. And maybe that's a good thing. Maybe generative AI should not be summarizing information that is associated with major consequences if you get it wrong.

Maybe. Maybe we should be happy with a result and the accompanying warning to consult a professional and not rely on it for anything important. And maybe we should forbid humans to do so as well. Because bad legal advice abound all around. What we do for healthcare isn't to forbid it, but to forbid impersonating a doctor, though.
 
Last edited:

Remove ads

Top