LLM evaluations are very sensitive to the details of the prompt's structure. This post shows how using structured generation reduces the results' variance and the ranking shifts.
my concern with grammar based sampling is that it makes the model dumber: after all, you are forcing it to say something else than what it thought would be best.
Intuitively, regex or json grammar have a much lower "semantic dimension" than what today LLMs allow. Maybe the observed performance gains result from such lower dimensionality.
That whole structured generation line of work looks promising. I hope someone else takes this and runs evaluations on other benchmarks. Curious to see if the results translate!