Without benchmarking LLMs, you're likely overpaying

Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.

- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N

Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps

Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition

pocketarc · 21 days ago

I use this approach for a ticket based customer support agent. There are a bunch of boolean checks that the LLM must pass before its response is allowed through. Some are hard fails, others, like you brought up, are just a weighted ding to the response's final score.

Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).

tomjakubowski · 21 days ago

Funny, this move is exactly what YouTube did to their system of human-as-judge video scoring, which was a 1-5 scale before they made it thumbs up/thumbs down in 2010.

jorvi · 21 days ago

I hate thumbs up/down. 2 values is too little. I understand that 5 was maybe too much, but thumbs up/down systems need an explicit third "eh, it's okay" value for things I don't hate, don't want to save to my library, but I would like the system to know I have an opinion on.

I know that consuming something and not thumbing it up/down sort-of does that, but it's a vague enough signal (that could also mean "not close enough to keyboard / remote to thumbs up/down) that recommendation systems can't count it as an explicit choice.

piskov · 21 days ago

How come accuracy has only 50% weight?

“You’re absolutely right! Nice catch how I absolutely fooled you”

Deleted Comment

lorey · 21 days ago

Yes, absolutely. This aligns with what we found. It seems to be necessary to be very clear on scoring (at least for Opus 4.5).

Imustaskforhelp · 21 days ago

This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks?

By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.

hamiltont · 21 days ago

Not sure I'm fully following your question, but maybe this helps:

IME deep thinking hgas moved from upfront architecture to post-prototype analysis.

Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging

With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate

When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.

The shift: from "design away problems" to "evaluate into solutions."

46493168 · 21 days ago

Isn’t this just rubrics?

8note · 21 days ago

its a weighted decision matrix.

Depends on what you’re doing. Using the smaller / cheaper LLMs will generally make it way more fragile. The article appears to focus on creating a benchmark dataset with real examples. For lots of applications, especially if you’re worried about people messing with it, about weird behavior on edge cases, about stability, you’d have to do a bunch of robustness testing as well, and bigger models will be better.

Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.

Id be careful is all.

candiddevmike · 21 days ago

One point in favor of smaller/self-hosted LLMs: more consistent performance, and you control your upgrade cadence, not the model providers.

I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.

blharr · 21 days ago

Where can I find information on self-hosting models success stories? All of it seems like throwing tens of thousands away on compute for it to work worse than the standard providers. The self-hosted models seem to get out of date, too. Or there ends up being good reasons (improved performance) to replace them

andy99 · 21 days ago

How much you value control is one part of the optimization problem. Obviously self hosting gives you more but it costs more, and re evals, I trust GPT, Gemini, and Claude a lot more than some smaller thing I self host, and would end up wanting to do way more evals if I self hosted a smaller model.

(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)

jmathai · 21 days ago

You may also be getting a worse result for higher cost.

For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.

Still waiting on human evaluation to confirm the LLM Judge was correct.

lorey · 21 days ago

That's interesting. Similarly, we found out that for very simple tasks the older Haiku models are interesting as they're cheaper than the latest Haiku models and often perform equally well.

andy99 · 21 days ago

You obviously know what you’re looking for better than me, but personally I’d want to see a narrative that made sense before accepting that a smaller model somehow just performs better, even if the benchmarks say so. There may be such an explanation, it feels very dicey without one.

lorey · 21 days ago

You're right. We did a few use cases and I have to admit that while customer service is easiest to explain, its where I'd also not choose the cheapest model for said reasons.

You are an expert linguist and translator engine. Task: Translate the input text from English into the languages listed below. Output Format: Return ONLY a valid, raw JSON object. Do not use Markdown formatting (no ```json code blocks). Do not add any conversational text. Keys: Use the specified ISO 639-1 codes as keys. Target Languages and Codes: - English: "en" (Keep original or refine slightly) - Mandarin Chinese (Simplified): "zh" - Hindi: "hi" - Spanish: "es" - French: "fr" - Arabic: "ar" - Bengali: "bn" - Portuguese: "pt" - Russian: "ru" - German: "de" - Urdu: "ur" Input text to translate: "A smiling boy holds a cup as three colorful lorikeets perch on his arms and shoulder in an outdoor aviary."

verdverm · 21 days ago

I'd second this wholeheartedly

Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used

Cut all my subs, spend less money, don't get rate limited

dpoloncsak · 21 days ago

Yeah, one of my first projects one of my buddies asked "Why aren't you using [ChatGPT 4.0] nano? It's 99% the effectiveness with 10% the price."

I've been using the smaller models ever since. Nano/mini, flash, etc.

sixtyj · 21 days ago

Yup.

I have found out recently that Grok-4.1-fast has similar pricing (in cents) but 10x larger context window (2M tokens instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination, lowest in blind tests in LLM arena.

phainopepla2 · 21 days ago

I have been benchmarking many of my use cases, and the GPT Nano models have fallen completely flat one every single except for very short summaries. I would call them 25% effectiveness at best.

walthamstow · 21 days ago

Flash Lite 2.5 is an unbelievably good model for the price

r_lee · 21 days ago

Plus I've found that overall with "thinking" models, it's more like for memory, not even actual perf boost, it might even be worse because if it goes even slightly wrong on the "thinking" part, it'll then commit to that for the actual response

for sure, the difference in the most recent model generations makes them far more useful for many daily tasks. This is the first gen with thinking as a significant mid-training focus and it shows

gemini-3-flash stands well above gemini-2.5-pro

PunchyHamster · 20 days ago

LLM bubble will burst the second investors figure out how much well managed local model can do

verdverm · 20 days ago

Except that

1. There is still night and day difference

2. Local is slow af

3. The vast majority of people will not run their own models

4. I would have to spend more than $200+ a month on frontier AI to come close the same price it would cost for any decent AI at home rig. Why would I not use frontier models at this point?

Dead Comment

PeterStuer · 20 days ago

All true, but from what I see in the field it is most often an "ain't nobody got time for that" as teams rush into adoption the costs be dammed for now. We'll deal with it only if cost becomes a major issue.

lorey · 20 days ago

Haha, very true. Exactly as described in the article.

gridspy · 21 days ago

Wow, this was some slick long form sales work. I hope your SaaS goes well. Nice one!

iFire · 21 days ago

I love the user experience for your product. You're giving a free demo with results within 5 minutes and then encourage the customer to "sign in" for more than 10 prompts.

Presumably that'll be some sort of funnel for a paid upload of prompts.

gforce_de · 19 days ago

Wow - interesting how strong the differences are!

What seems missing: I can not see the answer from the different models. One have to rely on the "correctness" score.

Another minor thing: the scoring seems hardcoded to: 50% correctness, 30% cost, 20% latency - which is OK, but in my case i care more about correctness and latency I don't care.

Wow! This was my testprompt:

https://evalry.com/question-benchmarks/game-engine-assistant...

Here's a bug report, by switching the model group the api hangs in private mode.

Headsup I think I broke the site.

Thanks. Will take a look.

Havoc · 21 days ago

I’m also collecting the data my side with the hopes of later using it to fine tuning a tiny model later. Unsure whether it’ll work but if I’m using APIs anyway may as well gather it and try to bottle some of that magic of using bigger models

xmcqdpt2 · 20 days ago

This is useful when selecting a model for an initial application. The main issue I'm concerned about though is ongoing testing. At work we have devs slinging prompt changes left and right into prod, after "it works on my machine" local testing. It's like saying the words "AI" is sufficient to get rid of all engineering knowledge.

Where is TDD for prompt engineering? Does it exist already?

This is a very good point. When I came in, the founder did a lot of evaluation based on a few prompts and with manual evaluation, exactly as described. Showing the results helped me underline the fact that "works for me" (tm) does not match the actual data in many cases.

cap11235 · 20 days ago

Evals have always existed, and not using them when building systems is relying on superstition.

This is true with one caveat.

In most cases, e.g. with regular ML, evals are easy and not doing them results in inferior performance. With LLMs, especially frontier LLMs, this has flipped. Not doing them will likely give you alight performance and at the same time proper benchmarks are tricky to implement.