My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.
You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...
Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.
https://help.ente.io/self-hosting/