In fact we know how to live forever, control our telomeres. We know it works because cancer exists. We just can’t control it but controlled cancer is effectively immortality.
This is actually very wrong. Consider for instance the fact that people who grade your tests in school are typically more talented, capable, trained than the people taking the test. This is true even when an answer key exists.
> Also, human labels are good but have problems of their own,
Granted, but...
> it isn’t like by using a “different intelligence architecture” we elide all the possible errors
nobody is claiming this. We elide the specific, obvious problem that using a system to test itself gives you no reliable information. You need a control.
I don’t think we should assume answering a test would be easy for a Scantron machine just because it is very good at grading them, either.
I don't think this is true for many fields - especially outside of math/programming. Let's say the task is "find the ten most promising energy startups in Europe." (This is essentially the sort of work I see people frequently talk about using research modes of models for here or on LinkedIn.)
In ye olden days pre-LLM you'd be able to easily filter out a bunch of bad answers from lazy humans since they'd be short, contain no detail, have a bunch of typos, formatting inconsistencies from copy-paste, etc. You can't do that for LLM output.
So unless you're a domain expert on European energy startups you can't check for a good answer without doing a LOT of homework. And if you're using a model that usually only looks at, say, the top two pages of Google results to try to figure this out, how is the validator going to do better than the original generator?
And what about when the top two pages of Google results start turning into model-generated blogspam?
If your benchmark can't evaluate prospective real-world tasks like this, it's of limited use.
A larger issue is that once your benchmark, that used this task as a criteria, based on an expert's knowledge, is published, anyone making an AI Agent is incredibly incentivized to (intentionally or not!) to train specifically on this answer without necessarily actually getting better at the fundamental steps in the task.
IMO you can never use an AI agent benchmark that is published on the internet more than once.
If they can’t write an evaluation for the discriminator I agree. All the input data issues you highlight also apply to generators.
Lots of other good replies to this specific part, but also, lots of developers are struggling with the feeling that reviewing code is harder than writing code (something I personally not sure I agree with), seen that sentiment being shared here on HN a lot, and would directly go against that particular idea.
Fundamentally I’m not disagreeing with the article, but also think most people who care take the above approach because if you do care you read samples, find the issues, and patch them to hill climb better
Deleted Comment
However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.
Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.
Discriminating good answers is easier than generating them. Good evaluations write test sets for the discriminators to show when this is or isn’t true. Evaluating the outputs as the user might see them are more representative than having your generator do multiple tasks (e.g. solve a math query and format the output as a multiple choice answer).
Also, human labels are good but have problems of their own, it isn’t like by using a “different intelligence architecture” we elide all the possible errors. Good instructions to the evaluation model often translate directly to better human results, showing a correlation between these two sources of sampling intelligence.
And what of its negotiating credibility? How can the other side trust that an agreement will hold in the future?
This is not a critique, but a genuine curiosity, because there's an obvious drawback with a system with opposing world views.
Unless, of course, something still unites them in the first place, with acceptable disparity on each side turning it into an advantage of flexibility and adaptability while keeping the focus on long-term ideas and plans.