mikeknoop (u/mikeknoop)

mikeknoop commented on Recent results show that LLMs struggle with compositional tasks quantamagazine.org/chatbo... · Posted by u/thm

mikeknoop · 7 months ago

One must now ask whether research results are analyzing pure LLMs (eg. gpt-series) or LLM synthesis engines (eg. o-series, r-series). In this case, the headline is summarizing a paper originally published in 2023 and does not necessarily have bearing on new synthesis engines. In fact, evidence strongly suggests the opposite given o3's significant performance on ARC-AGI-1 which requires on-the-fly composition capability.

mikeknoop commented on Arc Prize 2024 Winners and Technical Report arcprize.org/2024-results... · Posted by u/alphabetting

tbalsam · 9 months ago

I feel rather consternated that this response effectively boils down to "yes, we know we overhyped this to get people's attention, and now that we have it we can be more honest about it". Fighting for place in the attention economy is understandable, being deceptive about it is not.

This is part of the ethical morass of why some more serious researchers aren't touching the benchmark. People are not going to take it seriously if it continues like this!

mikeknoop · 9 months ago

I think we agree; to clarify, sharp messaging isn't inaccurate messaging. And I believe the story is not overhyped given the evidence: the benchmark resisted a $1M prize pool for ~6 months. But I concede we did obsess about the story to give it the best chance of survival in the marketplace of ideas against the incumbent AI research meme (LLM scaling). Now that the AI research field is coming around to the idea that something beyond deep learning is needed, the story matters less, and the benchmark, and future versions, can stand on their utility as a compass towards AGI.

mikeknoop commented on Arc Prize 2024 Winners and Technical Report arcprize.org/2024-results... · Posted by u/alphabetting

padswo1 · 9 months ago

I don’t think ARC has particularly advanced the research. The approaches that are successful were developed elsewhere and then applied to ARC. Happy to be shown somewhere this is not the case.

In the case of TTT, I wouldn’t really describe that as a ‘new AGI reasoning approach’. People have been fine tuning deep learning models on specific tasks for a long time.

The fundamental instinct driving the creation of ARC - that ‘deep learning cannot do system 2 thinking’, is under threat of being proven wrong very soon. Attempts to define the approaches that are working as somehow not ‘traditional deep learning’ really seem like shifting the goal posts.

mikeknoop · 9 months ago

Correct, fine-tuning is not new. It's long been used to augment foundational LLMs with private data. Eg. private enterprise data. We do this at Zapier, for instance.

The new and surprising thing about test-time training (TTT) is how effective it is an approach to deal with novel abstract reasoning problems like ARC-AGI.

TTT was pioneered by Jack Cole last year and popularized this year by several teams, including this winning paper: https://ekinakyurek.github.io/papers/ttt.pdf

mikeknoop commented on Arc Prize 2024 Winners and Technical Report arcprize.org/2024-results... · Posted by u/alphabetting

tbalsam · 9 months ago

As a rather experienced ML researcher, ARC is a great benchmark on its own, but is punching below its weight in terms of claiming that it is a gate (or in terms of this post -- a "steward") towards AGI, and in my perspective and the perspective of several researchers near me this has watered down the value of the ARC benchmark as a test.

It is a great unit test for reasoning -- that's fantastic! And maybe it is indeed the best way to test for this -- who knows exactly. But the claim is a little grandiose for what it is, this is somewhat similar to saying that testing on string parity is the One True Test for testing an optimizer's efficiency.

I'd heartily recommend maybe taking down the marketing vibrance down a notch and keep things a bit more measured, it's not entirely a meme, though some of the more-serious researchers don't take it as seriously as a result. And that's the kind of people that you want to attract to this sort of thing!

I think there is a potentially good future for ARC! But it might struggle to attract some of the kind of talent that you want to work on this problem as a result.

mikeknoop · 9 months ago

> I'd heartily recommend maybe taking down the marketing vibrance down a notch and keep things a bit more measured, it's not entirely a meme, though some of the more-serious researchers don't take it as seriously as a result.

This is fair critique. ARC Prize's 2024 messaging was sharp to break through the noise floor -- ARC has been around since 2019 but most only learned about it this summer. Now that it has garnered awareness, it is no longer useful, and in same cases hurting progress like you point out. The messaging needs to evolve and mature next year to be more neutral/academic.

mikeknoop commented on Arc Prize 2024 Winners and Technical Report arcprize.org/2024-results... · Posted by u/alphabetting

mikeknoop · 9 months ago

Author here -- six months ago we launched ARC Prize, a huge $1M experiment, to test if we need new ideas for AGI. The ARC-AGI benchmark remains unbeaten and I think we can now definitely say "yes".

One big update since June is that progress is no longer stalled. Coming into 2024, the public consensus vibe was that pure deep learning / LLMs would continue scaling to AGI. The fundamental architecture of these systems hasn't changed since ~2019.

But this flipped late summer. AlphaProof and o1 are evidence of this new reality. All frontier AI systems are now incorporating components beyond pure deep learning like program synthesis and program search.

I believe ARC Prize played a role here too. All the winners this year are leveraging new AGI reasoning approaches like deep-learning guided program synthesis, and test-time training/fine-tuning. We'll be seeing a lot more of these in frontier AI systems in coming years.

And I'm proud to say that all the code and papers from this year's winners are now open source!

We're going to keep running this thing annually until its defeated. And we've got ARC-AGI-2 in the works to improve on several of the v1 flaws (more here: https://arcprize.org/blog/arc-prize-2024-winners-technical-r...)

The ARC-AGI community keeps surprising me. From initial launch, through o1 testing, to the final 48 hours when the winning team jumped 10% and both winning papers dropped out of nowhere. I'm incredibly grateful to everyone and we will do our best to steward this attention towards AGI.

We'll be back in 2025!

mikeknoop commented on The surprising effectiveness of test-time training for abstract reasoning [pdf] mit.edu/~akyurek/papers/t... · Posted by u/trott

mikeknoop · 10 months ago

Context: ARC Prize 2024 just wrapped up yesterday. ARC Prize's goal is to be a north star towards AGI. The two major categories of this year's progress seem to fall into "program synthesis" and "test-time fine tuning". Both of these techniques are adopted by DeepMind's impressive AlphaProof system [1]. And I'm personally excited to finally see actual code implementation of these ideas [2]!

We still have a long way to go for the grand prize -- we'll be back next year. Also got some new stuff in the works for 2025.

Watch for the official ARC Prize 2024 paper coming Dec 6. We're going to be overviewing all the new AI reasoning code and approaches open sourced via the competition [3].

[1] https://deepmind.google/discover/blog/ai-solves-imo-problems...

[2] https://github.com/ekinakyurek/marc

[3] https://x.com/arcprize

mikeknoop commented on Show HN: Meet.hn – Meet the Hacker News community in your city · Posted by u/sirobg

dang · a year ago

I've often thought about building something like this into HN itself and having a 'meet' tab at the top. There are so many people, me included, whose lives have been changed by HN. I would love to find a way to encourage more positive effects in the real world. Not sure exactly how this would work, but it's encouraging to see the community taking up the cause again, and I'll be interested to hear about how this goes. If people start having real meetups maybe we can sponsor a semi-official thread about them or something.

mikeknoop · a year ago

I met my Zapier co-founder bryanh through HN 15 years ago when someone made a similar service to OP called "hacker newsers". We were the only two people in Missouri at the time which led to a meetup. https://news.ycombinator.com/item?id=1520916

mikeknoop commented on OpenAI o1 Results on ARC-AGI-Pub arcprize.org/blog/openai-... · Posted by u/z7

wokwokwok · a year ago

I think the parent post is complaining that insufficient acknowledgement is given to how good o1 is, because in their contrived testing, it seems better than previous models.

I don’t think that’s true though, it’s hard to be more fair and explicit than:

> OpenAI o1-preview and o1-mini both outperform GPT-4o on the ARC-AGI public evaluation dataset. o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Ie. it’s just not that great, and it’s enormously slow.

That probably wasn’t what people wanted to hear, even if it is literally what the results show.

You cant run away from the numbers:

> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

(Side note: readers may be getting confused about what “test-time scaling” is, and why that’s important. TLDR: more compute is getting better results at inference time. That’s a big deal, because previously, pouring more compute at inference time didn’t seem to make much real difference; but overall I don’t see how anything you’ve said is either inaccurate or misleading)

mikeknoop · a year ago

I personally am slightly surprised at o1's modest performance on ARC-AGI given the large leaps in performance on other objectively hard benchmarks like IOI and AIME.

Curiosity is the first step towards new ideas.

ARC Prize's whole goal is to inspire curiosity like this and to encourage more AI researchers to explore and openly share new approaches towards AGI.

mikeknoop commented on OpenAI o1 Results on ARC-AGI-Pub arcprize.org/blog/openai-... · Posted by u/z7

Stevvo · a year ago

"Greenblatt" shown with 42% in the bar chart is GPT-4o with a strategy: https://substack.com/@ryangreenblatt/p-145731248

So, how well might o1 do with Greenblatt's strategy?

mikeknoop · a year ago

I bet pretty well! Someone should try this. It's likely expensive but sampling could give you confidence to keep going. Ryan's approach costs about $10k to run the full 400 public eval set at current 4o prices -- which is the arbitrary limit we set for the public leaderboard.

mikeknoop commented on OpenAI o1 Results on ARC-AGI-Pub arcprize.org/blog/openai-... · Posted by u/z7

killthebuddha · a year ago

In my opinion this blog post is a little bit misleading about the difference between o1 and earlier models. When I first heard about ARC-AGI (a few months ago, I think) I took a few of the ARC tasks and spent a few hours testing all the most powerful models. I was kind of surprised by how completely the models fell on their faces, even with heavy-handed feedback and various prompting techniques. None of the models came close to solving even the easiest puzzles. So today I tried again with o1-preview, and the model solved (probably the easiest) puzzle without any kind of fancy prompting:

https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...

Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.

mikeknoop · a year ago

Author here. Which aspects are misleading? How can it be improved?