aluminum96 (u/aluminum96)

aluminum96 commented on OpenAI claims gold-medal performance at IMO 2025 twitter.com/alexwei_/stat... · Posted by u/Davidzheng

bwfan123 · 8 months ago

Did humans formalize the inputs ? or was the exact natural language input provided to the llm. A lot of detail is missing on the methodology used. Not to mention of any independent validation.

My skepticism stems from the past frontier math announcement which turned out to be a bluff.

aluminum96 · 8 months ago

People are reading a lot into the FrontierMath articles from a couple months ago, but tbh I don’t really understand what the controversy is supposed to be there. failing to clearly disclose sponsoring Epoch to make the benchmark clearly doesn’t affect performance of a model on it

aluminum96 commented on OpenAI claims gold-medal performance at IMO 2025 twitter.com/alexwei_/stat... · Posted by u/Davidzheng

skdixhxbsb · 8 months ago

> We can only go off their word

We’re talking about Sam Altman’s company here. The same company that started out as a non profit claiming they wanted to better the world.

Suggesting they should be given the benefit of the doubt is dishonest at this point.

aluminum96 · 8 months ago

“they must be lying because I personally dislike them”

This is why HN threads about AI have become exhausting to read

aluminum96 commented on OpenAI claims gold-medal performance at IMO 2025 twitter.com/alexwei_/stat... · Posted by u/Davidzheng

AIPedant · 8 months ago

It almost certainly is specialized to IMO problems, look at the way it is answering the questions: https://xcancel.com/alexwei_/status/1946477742855532918

E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig

Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.

aluminum96 · 8 months ago

OpenAI explicitly stated that it is natural language only, with no tools such as Lean.

https://x.com/alexwei_/status/1946477745627934979?s=46&t=Hov...

aluminum96 commented on OpenAI claims gold-medal performance at IMO 2025 twitter.com/alexwei_/stat... · Posted by u/Davidzheng

blibble · 8 months ago

openai have been caught doing exactly this before

aluminum96 · 8 months ago

Why do people keep making up controversial claims like this? There is no evidence at all to this effect

aluminum96 commented on OpenAI claims gold-medal performance at IMO 2025 twitter.com/alexwei_/stat... · Posted by u/Davidzheng

johnecheck · 8 months ago

The key bit here is whether the LLM doing the cherry picking had knowledge of the solution. If it didn't, this is a meaningful result. That's why I'd like more info, but I fear OpenAI is going to try to keep things under wraps.

aluminum96 · 8 months ago

Mark Chen posted that the system was locked before the contest. [1] It would obviously be crazy cheating to give verifiers a solution to the problem!

[1] https://x.com/markchen90/status/1946573740986257614?s=46&t=H...

aluminum96 commented on OpenAI claims gold-medal performance at IMO 2025 twitter.com/alexwei_/stat... · Posted by u/Davidzheng

emp17344 · 8 months ago

Why don’t they release some info beyond a vague twitter hype post? I’m beginning to hate OpenAI for releasing statements like this that invariably end up being less impressive than they make it sound initially.

aluminum96 · 8 months ago

The proofs were published on GitHub for inspection, along with some details (generated within the time limit, by a system locked before the problems were released, with no external tools).

https://github.com/aw31/openai-imo-2025-proofs/tree/main

aluminum96 commented on OpenAI claims gold-medal performance at IMO 2025 twitter.com/alexwei_/stat... · Posted by u/Davidzheng

bwfan123 · 8 months ago

Based on the past history with frontier-math & AIME 2025 [1],[2] I would not trust announcements which cant be independently verified. I am excited to try it out though.

Also, the performance of LLMs on imo 2025 was not even bronze [3].

Finally, this article shows that LLMs were just mostly bluffing [4] on usamo 2025.

[1] https://www.reddit.com/r/slatestarcodex/comments/1i53ih7/fro...

[2] https://x.com/DimitrisPapail/status/1888325914603516214

[3] https://matharena.ai/imo/

[4] https://arxiv.org/pdf/2503.21934

aluminum96 · 8 months ago

The solutions were publicly posted to GitHub: https://github.com/aw31/openai-imo-2025-proofs/tree/main

aluminum96 commented on GPT-4.5 or GPT-5 being tested on LMSYS? rentry.co/GPT2... · Posted by u/atemerev

93po · 2 years ago

what's the right answer? my assumption is "not enough information"

aluminum96 · 2 years ago

What, you mean your fruit preferences don't form a total order?

aluminum96 commented on Google lays off its Python team social.coop/@Yhg1s/112332... · Posted by u/compiler-guy

dmoy · 2 years ago

Same thing with the Kythe (aka Grok) team that does the cross references for codesearch.

aluminum96 · 2 years ago

Just vaporized a whole team so the roles can be moved overseas :(

aluminum96 commented on SFMTA's train system running on floppy disks; city fears 'catastrophic failure' abc7news.com/san-francisc... · Posted by u/sidlls

aluminum96 · 2 years ago

SF Voters rejected Proposition A in 2022 [1], which would have included funding to upgrade Muni's control systems (among many other projects). We'll eventually have to find the money somewhere else when the system fails.

[1] https://www.sfchronicle.com/sf/article/S-F-voters-narrowly-r...

u/aluminum96

KarmaCake day986October 30, 2018View Original