Going from free text to tool call with parameters in the grand scheme of things is one of the easiest things to do especially when you only have a limited number of tools.
Seems like it works pretty well. Our prompts and params get tweaked towards better and better results, and we get a sense of what’s worth paying more for.
In some cases I've seen teams rely on a mix of automated metrics and human review, especially for production systems where reliability matters a lot.
But evaluation pipelines for AI still seem much less standardized compared to traditional software monitoring.
It wouldn’t be too difficult to build something like that for your own usage, but I found it pretty easy to get datasets set up.
Essentially a game changer in understanding if your prompts are working. Especially if you’re doing something which requires high levels of consistency.
In our case we would use LLM for classification which fits in perfectly with evals.
I would like a single number that I would use to optimize the pipeline with but I find it hard to figure out what that number should be measuring.
- AI to evaluate itself (eg ask claude to test out its own skill) - custom built platform (I see interest in this space)
I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?
I like to play with knowledge base powered chatbots but what's most useful to me (and probably my primary use case) is coding agents since I use CC every day. Recently I just heard about Minimax m2.5 which apparently is a pretty good coding agent (they say it's comparable to opus 4.6) but I haven't tried it yet — plus it'd take a lot of time to figure out whether it's better or not.
Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.
I find that for every hypothesis I might have to run a thousand prompts to collect enough data for a conclusion. For instance, to discover how reliably different models can extract noun phrases from a text: hours of grinding. Even so that was for a small text. I haven’t yet run the process on a large text.