Readit News logoReadit News
curioussquirrel · 4 months ago
Interesting experiment, but I'd say aggregating the scores across models is far from ideal. Gemini 1.5 Flash got close-to-perfect scores on most languages (probably boils down to small variances in temp/top_k and statistical error). Small models are generally quite bad at non-English languages and tank the overall performance.

BTW, newer generations of models seem to have made some real progress in multilingual performance.