I've been off the Internet for a week, and I've just read the 20 pages of this thread that I missed.
I can't be bothered to go through all 20 pages to find the post in question, but one post that made a point that motivated this reply was on the subject of "people who roll are only doing so to get Mary-Sue". I won't bother to refute this idea as it should be obvious it refutes itself, but the specific point was made about the CR system being all messed up if high stat PCs are used, and the array '16, 16, 14, 12, 12, 12' was derided for being too high for the system to handle without breaking.
Let's take your scenario idea and have two unarmoured barbarian humans instead of two fighter dwarves that have already made so much money from adventuring that they can both afford full plate armour at 4th level(!)....would you think that this would be at least as 'fair' a test as your own?
Let's say that the 'so unplayable it's broken' array mentioned above was given to one twin, ending up with: Str 16 Dex 14 Con 16 Int 12 Wis 12 Cha 12.
Now compare this with the other twin, made with point-buy: Str 16 Dex 14 Con 16 Int 8 Wis 10 Cha 8. Is there any difference at all when they do what they are intended to do: solo a hell hound, using your program?
And yet, the first is deemed as so good it's breaking the system. Point-buy will not let you buy it, but it allows for more concepts.
So you have a problem with the fact that I gave them both armor? I'm pitting 4th level characters against a hell hound in a cage match. I thought I'd at least give them a fighting chance.

But their AC is identical so it should have no impact on the scenario (assuming they both use heavy armor). Maybe someday I'll allow AC adjustment in my program.
I'm measuring relative power. In relationship
to each other the guy with the higher stats is much better off.
That's not a judgement, it's just giving some numbers around how much better off they are. BTW I've also run the numbers with some other options (give me a week or two and a roadtrip to build in a little more flexibility) but the difference is between 20-30% more effective in combat.
I'm not sure where the Mary-Sue came from. It was some tangent. Similar to the tangent of whether or not 4th level PCs should have plate. Suffice to say that it came from the discussion that I don't want to play characters that have above average stats (like the 11 being the lowest number). Well, and something about claiming that you can't build a character to a concept with point buy because if your concept is "I'm the strongest, toughest, fastest, smartest, wisest, prettiest person to ever walk the planet" you can't build it.
As far as someone with 16 16 14 12 12 12 vs 16 14 14 10 8 8, I've never said either one breaks the system or is unplayable. I've just stated that with standard 4d6 drop lowest there will be on average a 2 point difference for every stat. Based on my scenario that equates to a 20-30% difference in combat effectiveness.
I think this ties goes back into the "CR is broken" idea. If people are using systems that provide consistently higher than average points, those characters are going to be significantly more effective at combat on average.
To summarize: I wrote a program to see how much variance you could expect with what some people consider minor differences. While there could be a near infinite variation of generation methods and ways of measuring effectiveness, that does not negate the result I came up with. I chose one scenario that was common but also simple to model. It showed a much more significant power difference than I had expected. That's it.