Yeeeah, but we aren't really talking about a misalignment between expectations and results, we are talking about wildly inaccurate estimations. Like, you can engage a battle that is literally mathematically unwinnable with your units (even just considering the sheer stats involved, with no special ability whatsoever) and the autosolve could give you a flawless victory. Or the other way around.
With all due respect, unless you take a look at the code, tried the numbers available for damage and the formula and made a study that proved the p-value was not statistically significant, the results may not be as "wildly inaccurate" or "mathematically unwinnable" as you may think
And I´m not saying you are necessarily in the wrong, I just pointed that those are very strong statements you are making without numerical sources to back it up. As I said before, humans are very bad at calculating probabilities. That´s why they invented psychometrics (not the paranormal one).
That said, I agree with the assumption that the formula does not take into account the tactics that the IA and the player uses in manual combat, so it may be worthwhile to use manual combat when the enemy general has some powerful abilities that could wipe out your army ( like scorching ray, bane, etc)