There are always two error rates.
Defining behavior is great for retrospective analysis but would you really feel comfortable putting hard cuts into production based on the answers to those questions? I’m genuinely asking, because IME I wouldn’t be.
Estimate what a real human can do in a day, and use that as the limits. Verify that the system behaves ok for some time, then scale up the desired trading volume and limits, observe, scale, repeat.
But you don't do it by making a (bad) guess up front and then just leaving it at that.
Bugs and configuration errors will happen from time to time, and might look silly in retrospect. But the real problem was, I think, that there was no kill switch (managers and tech leads should have decided to add long ago)