I like to use txtar files for snapshot testing. I let one of the file fields contain the expected output and one or more contain the input(s). Most mainstream languages already have txtar parsers so this approach makes it trivial to port an entire test suite across languages.
See https://docs.rs/txtar/latest/txtar/ for an example
basically in my testing really felt that gpt5 was "using tools to think" rather than just "using tools". it gets very powerful when coding long horizon tasks (a separate post i'm publishing later).
to give one substantive example, in my developer beta (they will release the video in a bit) i put it to a task that claude code had been stuck on for the last week - same prompts - and it just added logging to instrument some of the failures that we were seeing and - from the logs that it added and asked me to rerun - figured out the solve.
Sorry, but this sounds like overly sensational marketing speak and just leaves a bad taste in the mouth for me.