fallmonkey (u/fallmonkey)

fallmonkey commented on Tau² benchmark: How a prompt rewrite boosted GPT-5-mini by 22% quesma.com/blog/tau2-benc... · Posted by u/blndrt

tedsanders · 6 months ago

>GPT-5 showed significant improvement only in one benchmark domain - which is Telecom. The other ones have been somehow overlooked during model presentation - therefore we won’t bother about them either.

I work at OpenAI and you can partly blame me for our emphasis on Telecom. While we no doubt highlight the evals that make us look good, let me defend why the emphasis on Telecom isn't unprincipled cherry picking.

Telecom was made after Retail and Airline, and fixes some of their problems. In Retail and Airline, the model is graded against a ground truth reference solution. Grading against a reference solution makes grading easier, but has the downside that valid alternative solutions can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why Airline and Retail scores stopped climbing with the latest generations of models and are stuck around 60% / 80%. I'd bet you $100 that a superintelligence would probably plateau around here too, as getting 100% requires perfect guessing of which valid solution is written as the reference solution.

In Telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So Telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.

Even if I had never seen GPT-5's numbers, I like to think I would have said ahead of time that Telecom is much better than Airline/Retail for measuring tool use.

Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, which results in very poor scores. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if your tasks trigger a quirk not present in the eval).

Here's the tau2-bench paper if anyone wants to read more: https://arxiv.org/abs/2506.07982

fallmonkey · 6 months ago

Appreciated the response! I noticed the same when I ran tau2 myself on gpt-5 and 4.1, where gpt-5 is really good at looking at tool results and interleaving those with thinking, while 4.1/o3 struggles to decide the proper next tool to use even with thinking. To some extent, gpt-5 is too good at figuring out the right tool to use in one go. Amazing progress.

fallmonkey commented on Are OpenAI and Anthropic losing money on inference? martinalderson.com/posts/... · Posted by u/martinald

atq2119 · 6 months ago

When using APIs, you pay for reasoning tokens like you do for actual outputs. So, the estimation on a per-token basis is not affected by reasoning.

What reasoning affects is the ratio of input to output tokens, and since input tokens are cheaper, that may well affect the economics in the end.

fallmonkey · 7 months ago

Correct, and with reasoning, the ratio is totally off. As others have pointed out, actual usage is way higher (much more than 3-5x) than the estimation in the article, which is probably for very trivial users.

fallmonkey commented on Are OpenAI and Anthropic losing money on inference? martinalderson.com/posts/... · Posted by u/martinald

fallmonkey · 7 months ago

The estimation for output token is too low since one reasoning-enabled response can burn through thousands of output tokens. Also low for input tokens since in actual use there're many context (memory, agents.md, rules, etc) included nowadays.

fallmonkey commented on LIMO: Less Is More for Reasoning arxiv.org/abs/2502.03387... · Posted by u/trott

fallmonkey · a year ago

While there're interesting findings here, https://arxiv.org/pdf/2502.03373 (also with a lot of good findings) suggested some contradicting theory on the critical mass of training process/data for the sake of reasoning capability.

fallmonkey commented on Deepseek: The quiet giant leading China’s AI race chinatalk.media/p/deepsee... · Posted by u/sunny-beast

fallmonkey · a year ago

Strangely, deepseek has been always a prominent name in open source LLM community since last year, with their repos and papers - https://github.com/deepseek-ai. Nothing of it is really quiet except that they probably burn 1% of marketing money compared to other China LLM players.

fallmonkey commented on If needed, you have a role at Microsoft that matches your compensation twitter.com/kevin_scott/s... · Posted by u/intellectronica

darknavi · 2 years ago

> Know that if needed, you have a role at Microsoft that matches your compensation

Great if true.

fallmonkey · 2 years ago

There’s an interesting dynamic here. What value of OpenAI to use for conversion calculation? 86b round is pretty much dead if they move to MS, yet 29b is too low (even 86b is low in terms of future potential). And what kind of upward room there will be?

I have no doubt that MS can spend billions in cash or RSU to compensate all of them, but I do believe there’s some gotcha if exodus actually happens and MS might not be so generous in throwing millions of dollar cash for a general backend engineer recently joining OpenAI.

fallmonkey commented on Ask HN: Books to read when you transform from SWE into SWE Management? · Posted by u/DDerTyp

fallmonkey · 4 years ago

I'd recommend Julie Zhuo's The Making of a Manager. She's got a twitter https://twitter.com/joulee where you could check out her sharing of managerial wisdom to have your own gauge.

This book focuses on the practical side like sharing useful feedback, smart recruiting strategy and meeting optimization, all towards the goal of greater outcome and amplifying team success, instead of just more activities entailed by conventional manager model.