upperhalfplane (u/upperhalfplane)

upperhalfplane commented on Problems in LLM Benchmarking and Evaluation xent.tech/blog/problems-i... · Posted by u/acegod

tianlong · 9 days ago

What's the TLDR about how to solve that benchmark problem?

upperhalfplane · 9 days ago

TLDR: games are a good way to go.

upperhalfplane commented on Problems in LLM Benchmarking and Evaluation xent.tech/blog/problems-i... · Posted by u/acegod

sbilstein · 9 days ago

Personally I think LLM benchmarks make agents worse. All these companies chase the benchmarks, overfit, and think being able to cheat at the math olympiad is gonna get us to AGI. Instead researchers should peer in and get me an agent that can reliably count the number of "i"'s in mississippi.

upperhalfplane · 9 days ago

I don't quite think they cheat at math olympiads, but obviously there are blindspots for the unspectacular tasks. That being said, Mississippi is both a good and a bad question to ask. On the one hand, it's "the bare minimum" to require, on the other hand, is it really a feat? Like, most models can write a piece of code that would compute that. If you show me a task I'm not designed to solve (like count the number of i's in this text), the smart thing is actually to write a program to count them (which LLMs can do).

The best way to measure intelligence is probably to have a model know its strengths and weaknesses, and deal with them in an efficient way. And the most important thing for eval is that ability.

upperhalfplane commented on Training language models to be warm and empathetic makes them less reliable arxiv.org/abs/2507.21919... · Posted by u/Cynddl

meowface · 12 days ago

I am skeptical that any model can actually determine what sort of prompts will have what effects on itself. It's basically always guessing / confabulating / hallucinating if you ask it an introspective question like that.

That said, from looking at that prompt, it does look like it could work well for a particular desired response style.

upperhalfplane · 11 days ago

> It's basically always guessing / confabulating / hallucinating if you ask it an introspective question like that.

You're absolutely right! This is the basis of this recent paper https://www.arxiv.org/abs/2506.06832

upperhalfplane commented on OpenAI claims gold-medal performance at IMO 2025 twitter.com/alexwei_/stat... · Posted by u/Davidzheng

crinkly · a month ago

100% agree with this.

My second degree is in mathematics. Not only can I probably not do these but they likely aren’t useful to my work so I don’t actually care.

I’m not sure an LLM could replace the mathematical side of my work (modelling). Mostly because it’s applied and people don’t know what they are asking for, what is possible or how to do it and all the problems turn out to be quite simple really.

upperhalfplane · a month ago

100% agree about this too (also a professional mathematician). To mathematicians who have not been trained on such problems, these will typically look very hard, especially the more recent olympiad problems (as opposed to problems from eg 30 years ago). Basically these problems have become more about mastering a very impressive list of techniques than at the inception (and participants prepare more and more for these). On the other hand, research mathematics has become more and more technical, but the techniques are very different, so that the correlation between olympiads and research is probably smaller than it once was.

upperhalfplane commented on XentGame: Help Minimize LLM Surprise xentlabs.ai... · Posted by u/upperhalfplane

frotaur · 6 months ago

pretty cool, managed to get some reasonable score. I wonder if the high scores are close to the 'theoretical maximum', or if we are an order of magnitude below.

upperhalfplane · 6 months ago

I don't think it's an order of magnitude below... though it's a bit hard to know for sure

upperhalfplane commented on XentGame: Help Minimize LLM Surprise xentlabs.ai... · Posted by u/upperhalfplane

tianlong · 6 months ago

Interesting I'm wondering if it is possible to cheat using LLMs!

upperhalfplane · 6 months ago

It looks like you can (there are some LLM responses out there, e.g. Sonnet 3.5). Not clear if they can be super good at this, though.