Counterintuitively, I feel like this will not be super useful, at least for me. My bottleneck is MY ability to parse and understand LLM-generated code. The agent can code a lot faster than I can read and understand its output.
Spend time building a test harness and evaluations of whether the solution meets the requirements. Then you don't need to look at the code because those other pieces will bring the necessary guarantees and trust.