alextheparrot (u/alextheparrot)

alextheparrot commented on AI agent benchmarks are broken ddkang.substack.com/p/ai-... · Posted by u/neehao

tempfile · 2 months ago

> Discriminating good answers is easier than generating them.

This is actually very wrong. Consider for instance the fact that people who grade your tests in school are typically more talented, capable, trained than the people taking the test. This is true even when an answer key exists.

> Also, human labels are good but have problems of their own,

Granted, but...

> it isn’t like by using a “different intelligence architecture” we elide all the possible errors

nobody is claiming this. We elide the specific, obvious problem that using a system to test itself gives you no reliable information. You need a control.

alextheparrot · 2 months ago

It isn’t actually very wrong. Your example is tangential as graders in school have multiple roles — teaching the content and grading. That’s an implementation detail, not a counter to the premise.

I don’t think we should assume answering a test would be easy for a Scantron machine just because it is very good at grading them, either.

alextheparrot commented on AI agent benchmarks are broken ddkang.substack.com/p/ai-... · Posted by u/neehao

majormajor · 2 months ago

> Discriminating good answers is easier than generating them.

I don't think this is true for many fields - especially outside of math/programming. Let's say the task is "find the ten most promising energy startups in Europe." (This is essentially the sort of work I see people frequently talk about using research modes of models for here or on LinkedIn.)

In ye olden days pre-LLM you'd be able to easily filter out a bunch of bad answers from lazy humans since they'd be short, contain no detail, have a bunch of typos, formatting inconsistencies from copy-paste, etc. You can't do that for LLM output.

So unless you're a domain expert on European energy startups you can't check for a good answer without doing a LOT of homework. And if you're using a model that usually only looks at, say, the top two pages of Google results to try to figure this out, how is the validator going to do better than the original generator?

And what about when the top two pages of Google results start turning into model-generated blogspam?

If your benchmark can't evaluate prospective real-world tasks like this, it's of limited use.

A larger issue is that once your benchmark, that used this task as a criteria, based on an expert's knowledge, is published, anyone making an AI Agent is incredibly incentivized to (intentionally or not!) to train specifically on this answer without necessarily actually getting better at the fundamental steps in the task.

IMO you can never use an AI agent benchmark that is published on the internet more than once.

alextheparrot · 2 months ago

> Good evaluations write test sets for the discriminators to show when this is or isn’t true.

If they can’t write an evaluation for the discriminator I agree. All the input data issues you highlight also apply to generators.

alextheparrot commented on AI agent benchmarks are broken ddkang.substack.com/p/ai-... · Posted by u/neehao

diggan · 2 months ago

> Discriminating good answers is easier than generating them.

Lots of other good replies to this specific part, but also, lots of developers are struggling with the feeling that reviewing code is harder than writing code (something I personally not sure I agree with), seen that sentiment being shared here on HN a lot, and would directly go against that particular idea.

alextheparrot · 2 months ago

I wish the other replies and this would engage with the sentence right after it indicating that you should test this premise empirically.

alextheparrot commented on AI agent benchmarks are broken ddkang.substack.com/p/ai-... · Posted by u/neehao

suddenlybananas · 2 months ago

What's 45+8? Is it 63?

alextheparrot · 2 months ago

If this sort of error isn’t acceptable, it should be part of an evaluation set for your discriminator

Fundamentally I’m not disagreeing with the article, but also think most people who care take the above approach because if you do care you read samples, find the issues, and patch them to hill climb better

alextheparrot commented on AI agent benchmarks are broken ddkang.substack.com/p/ai-... · Posted by u/neehao

jerf · 2 months ago

When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

alextheparrot · 2 months ago

LLMs evaluating LLM outputs really isn’t that dire…

Discriminating good answers is easier than generating them. Good evaluations write test sets for the discriminators to show when this is or isn’t true. Evaluating the outputs as the user might see them are more representative than having your generator do multiple tasks (e.g. solve a math query and format the output as a multiple choice answer).

Also, human labels are good but have problems of their own, it isn’t like by using a “different intelligence architecture” we elide all the possible errors. Good instructions to the evaluation model often translate directly to better human results, showing a correlation between these two sources of sampling intelligence.

alextheparrot commented on How we’re responding to The NYT’s data demands in order to protect user privacy openai.com/index/response... · Posted by u/BUFU

energy123 · 3 months ago

To opt-out of your data being trained on, you need to go to https://privacy.openai.com and click the button "Make a Privacy Request".

alextheparrot · 3 months ago

in the app: Settings ~> Data Controls ~> Improve the model for everyone

alextheparrot commented on Retailers will soon have only about 7 weeks of full inventories left fortune.com/article/retai... · Posted by u/andrewfromx

sam_goody · 4 months ago

I know nothing of economics, and am not trying to defend Trump's moves.

But, it is possible that his policy of "do everything at once, without taking the time to do it right" is more reflective of his belief that whatever he tries [even just being president] will be fought, so his options [from his POV] are "do it now" or "don't do it at all", not "do it right".

EDIT: Am willing to be learn, would the downvoters explain - do you disagree that this is his view? Or does his understanding not matter when he acts upon it?

alextheparrot · 4 months ago

That’s a premise that would make me consider the wiseness of my actions.

alextheparrot commented on Lawmakers are skeptical of Zuckerberg's commitment to free speech theverge.com/news/646288/... · Posted by u/speckx

cloogshicer · 5 months ago

I don't understand this position. Why do people want private companies to decide what's allowed and what isn't? Shouldn't lawmakers, and by extension the people (at least in democratic countries) decide what speech is allowed and what isn't?

The number of people using social media makes it the town square of the present. We should treat it as such.

alextheparrot · 5 months ago

Quippy, but off the cuff: - I don’t go to my present town square(s) socially because it is full of a-social behavior. Same reason to avoid certain bars or clubs, prefer certain parks, or why some are wary of public transit.

- I don’t feel a right to decide the vibe of how a business curates its space. My bakery, coffee shop, local library, etc. all curate a space with an opinion. I don’t feel I have standing to assert that my preferences should dominate their choices.

As an aside, businesses are also an extension of the people, the best ones tend to just not be mode collapsed

alextheparrot commented on The hacking of culture and the creation of socio-technical debt schneier.com/blog/archive... · Posted by u/BorgHunter

alextheparrot · a year ago

Really enjoyed the piece.

A passing thought: the ethe of individuals in the 70s and 80s is important because of the people it informed in subsequent years. While many people still like to hack, code, etc., the relative proportion of people doing this and working in tech continues to diminish as the popularity and importance of the sector grows. I wonder if debt without values / a more cohered zeitgeist is better or worse?