stop underestimating the amount of internalized knowledge people can have about projects in the real world, it's so annoying.
an llm can't ever possibly get close to it. there's some guy in a team in another building who knows why a certain weird piece of critical business logic was put there 6 years ago, the llm will never know this and won't understand this even if it consumed the whole repository because it would have to work there for years to understand how the business works
In my experience working with agents helps eliminate that crap, because you have to bring the agent along as it reads your code (or process or whatever) for it to be effective. Just like human co-workers need to be brought along, so it’s not all on poor Bob.
I don't think this is true for many fields - especially outside of math/programming. Let's say the task is "find the ten most promising energy startups in Europe." (This is essentially the sort of work I see people frequently talk about using research modes of models for here or on LinkedIn.)
In ye olden days pre-LLM you'd be able to easily filter out a bunch of bad answers from lazy humans since they'd be short, contain no detail, have a bunch of typos, formatting inconsistencies from copy-paste, etc. You can't do that for LLM output.
So unless you're a domain expert on European energy startups you can't check for a good answer without doing a LOT of homework. And if you're using a model that usually only looks at, say, the top two pages of Google results to try to figure this out, how is the validator going to do better than the original generator?
And what about when the top two pages of Google results start turning into model-generated blogspam?
If your benchmark can't evaluate prospective real-world tasks like this, it's of limited use.
A larger issue is that once your benchmark, that used this task as a criteria, based on an expert's knowledge, is published, anyone making an AI Agent is incredibly incentivized to (intentionally or not!) to train specifically on this answer without necessarily actually getting better at the fundamental steps in the task.
IMO you can never use an AI agent benchmark that is published on the internet more than once.
That's true if you're just evaluating the final answer. However, wouldn't you evaluate the context -- including internal tokens -- built by the LLM under test ?
In essence, the evaluator's job isn't to do separate fact-finding, but to evaluate whether the under-test LLM made good decisions given the facts at hand.