Okay. So no relevant additions to the discussion. We are at an impasse. My suggestion is that there is no great way to test ai outside of a live environment.
Well, that heart-lung machine I mentioned earlier - do you figure the
only tests ever done on it were directly in surgery? And that when they began tests in surgery, they just sold them to whoever and waited for someone to complain? Of course not! That thing was tested in its various individual parts, then the the thing was tested on liquids the same viscosity as blood. Then they put animal blood in it, then human blood - all while attached to machines that monitor the temperature, pressure, oxygenation and so forth. The blood will be examines after being run through the machine, to make sure the cells are undamaged.
Putting a human whose life
depends on it is only a final stage, and that is done in carefully controlled trials, with backups if something goes wrong. And the results of all this is reviewed by outside experts, who have no financial stake in the results, before the machine will be certified for general use in hospitals.
Your suggestion is that there must be a good way but that you cannot comment on what it might be due to lack of expertise. There’s nothing more to say here.
You had very specific questions, like "What counts as pass/fail", which I cannot answer in a general sense. I do software project management. I can only speak to software QA broadly, and highly simplified manner.
For example, you don't make an entire application, and then throw it out to the market untested and wait for users to complain about errors.
Testing software is a multi-faced, multi-layered thing. There is testing done by developers to make sure smaller sections of code work as expected (often called "unit testing"). There is testing on a larger scale, that checks to see if separate parts of an application interact as expected (often called "integration testing", "system integration testing" or "SIT testing"). Then there's testing in which we check to see that what results the end-user gets are what are expected/desired (usually referred to as "functional testing").
Also, software has multiple types of environments it can exist in. There are development environments in which developers work, that are highly dynamic and change rapidly as engineers make changes to get things to work. There are QA environments in which most SIT and functional testing happens. There's "staging" environments that are the place software goes (and can again be tested) that are typically as much like the environment the public sees as possible, and then finally there's "production", which is where you and I see it, available to the public.
You don't generally test in production. End-users are already getting at it there, and any problems found there are errors that end-users see and are impacted by, and thinking horrible things about your company as they go wrong. You always want to find errors before software gets to production.
Testing is not "now start using it randomly and report when something goes wrong". QA professionals are exacting, and methodical. They write hundreds, thousands, and tens of thousands of test cases, checking hundreds, thousand, and tens of thousands of individual behaviors of the system. For a big system, if you are serious, those human-written test cases are fed into a system that automates executing the tests and checking if the result matches what the QA engineer said it should, and marks the test as failed if it doesn't. That defines a bug, that gets handed back to the developers to fix. Lather-rinse-repeat until the tests all pass.
Where do those tests cases come from? People who make software have product managers who define what the software is supposed to do. For, say, an application that's supposed to support a doctor in diagnosing ailments, they'd define what ailments are in the list that the system is supposed to be able to catch, and upon what basis they are to suggest a diagnosis. QA will test whether the system gives the right results, or wrong results.
There should be no reason to do all this testing in a "live", meaning a production, system. You do it back behind the scenes in a controlled QA environment, with databases just like you would have in production, and so forth.