Good catch: the calculators here are bizarre. For GPT-4o, a 512x512 image uses 170 tile tokens. For GPT-4o mini, a 512x512 image uses 5,667 tile tokens. How does that even work in the context of a ViT? The patches and its image encoder should be the same size/output.
Since the base token counts increase proportionally (which makes even less sense) I have a hunch there's a JavaScript bug instead.
Confirmed that mini uses ~30x more tokens than base gpt-4o using same image/same prompt: { completionTokens: 46, promptTokens: 14207, totalTokens: 14253 } vs. { completionTokens: 82, promptTokens: 465, totalTokens: 547 }.
What is up with that eval @32? Am I reading it correctly that they are generating 32 responses and taking majority? Who will use the API like that? That feels like such a fake way to improve metrics
Page 7 of their technical report [0] has a better apples to apples comparison. Why they choose to show apples to oranges on their landing page is odd to me.
We've been using Braintrust for evals at Zapier and it's been really great -- pumped to try out this proxy (which should be able to replace some custom code we've written internally for the same purpose!).
This if anything should be a huge red flag