Readit News logoReadit News
yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
kelseyfrog · a day ago
20mins or so. The bottleneck is rate-limiting. It's amenable to parallelization. Each tests can run in isolation at the same time.
yelmahallawy · 15 hours ago
Gotchu. Yeah that's pretty quick, awesome thanks!
yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
raw_anon_1111 · a day ago
No. I also use the least sophisticated but fastest model that Amazon hosts - and it hosts all of them except OpenAI models - Nova Lite

Going from free text to tool call with parameters in the grand scheme of things is one of the easiest things to do especially when you only have a limited number of tools.

yelmahallawy · 15 hours ago
Makes sense, simpler=better. Thanks!
yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
mock-possum · 3 days ago
We feed a handful of preset questions through the new AI, we collect the results, we ask another AI to score the answers based on example ‘hood’ answers we’ve written, then we have a guy sit down and use the fallout as a starting point to rank the performance of that AI, compared to all the previous ones.

Seems like it works pretty well. Our prompts and params get tweaked towards better and better results, and we get a sense of what’s worth paying more for.

yelmahallawy · a day ago
The guy who reviews all of this, is his role in the company fully dedicated to reviewing these eval pipelines?
yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
gabdiax · 2 days ago
I'm also curious about this.

In some cases I've seen teams rely on a mix of automated metrics and human review, especially for production systems where reliability matters a lot.

But evaluation pipelines for AI still seem much less standardized compared to traditional software monitoring.

yelmahallawy · a day ago
Yeah, it feels like an unsolved problem still. I've also seen many teams spend hours on human review in eval pipelines (and this accumulates with each new model that gets released).
yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
mierz00 · 2 days ago
I highly rate Braintrust.

It wouldn’t be too difficult to build something like that for your own usage, but I found it pretty easy to get datasets set up.

Essentially a game changer in understanding if your prompts are working. Especially if you’re doing something which requires high levels of consistency.

In our case we would use LLM for classification which fits in perfectly with evals.

yelmahallawy · a day ago
Have some good takeaways / feedback on this? First time I hear about Braintrust (the eval platform) so I'll look into it but I'm curious on your experience with it so far.
yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
maxalbarello · 3 days ago
Also wondering how to evals agentic pipelines. For instance, I generated memories from my chatGPT conversation history, how do I know whether they are accurate or not?

I would like a single number that I would use to optimize the pipeline with but I find it hard to figure out what that number should be measuring.

yelmahallawy · a day ago
And I think this is a common problem actually — figuring out what to measure and how to measure it – it's not black and white. What I do is have a few dimensions to measure it against (this may or may not fit your use case): relevance, instruction following, clarity, hallucination rate, etc. but even then, it becomes hard to measure things like 'clarity'.
yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
bisonbear · 3 days ago
assume you're referencing coding agents - I don't think people are. If they are, it's likely using

- AI to evaluate itself (eg ask claude to test out its own skill) - custom built platform (I see interest in this space)

I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?

yelmahallawy · a day ago
I'd love to hear more about what you're working on (if you're open to sharing!).

I like to play with knowledge base powered chatbots but what's most useful to me (and probably my primary use case) is coding agents since I use CC every day. Recently I just heard about Minimax m2.5 which apparently is a pretty good coding agent (they say it's comparable to opus 4.6) but I haven't tried it yet — plus it'd take a lot of time to figure out whether it's better or not.

yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
celestialcheese · 3 days ago
mix of promptfoo and ad-hoc python scripts, with langfuse observability.

Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.

yelmahallawy · a day ago
Any takeaways on Promptfoo?
yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
moltar · 2 days ago
I use Promptfoo
yelmahallawy · a day ago
Any takeaways? Has it been helpful? OpenAI just acquired them so it's probably useful but I was curious to hear more from people who've actually used it.
yelmahallawy commented on Ask HN: How are people doing AI evals these days?    · Posted by u/yelmahallawy
satisfice · 3 days ago
It’s called testing. And from the reports and comments, there doesn’t seem to be much of it happening. The reason is: it’s quite expensive to do well.

I find that for every hypothesis I might have to run a thousand prompts to collect enough data for a conclusion. For instance, to discover how reliably different models can extract noun phrases from a text: hours of grinding. Even so that was for a small text. I haven’t yet run the process on a large text.

yelmahallawy · a day ago
Yeah it's a super tedious process and I was hoping that _maybe_ there is a tool out there that can help with this.

u/yelmahallawy

KarmaCake day7April 13, 2021
About
Just a developer who likes building things.
View Original