shawntan (u/shawntan)

shawntan commented on Less is more: Recursive reasoning with tiny networks alexiajm.github.io/2025/0... · Posted by u/guybedo

cubefox · 5 months ago

> Now we don't know if we're generalising or memorising.

The Arc HRM blog post says:

> [...] we set out to verify HRM performance against the ARC-AGI-1 Semi-Private dataset - a hidden, hold-out set of ARC tasks used to verify that solutions are not overfit [...] 32% on ARC-AGI-1 is an impressive score with such a small model. A small drop from HRM's claimed Public Evaluation score (41%) to Semi-Private is expected. ARC-AGI-1's Public and Semi-Private sets have not been difficulty calibrated. The observed drop (-9pp) is on the high side of normal variation. If the model had been overfit to the Public set, Semi-Private performance could have collapsed (e.g., ~10% or less). This was not observed.

shawntan · 5 months ago

The question I keep coming back to is whether ARC-AGI is intended to evaluate generalisation to the task at hand. This would then mean that the test data has a meaningful distribution shift from the training data, and only a model that can perform said generalisation can do well.

This would all go out the window if the model being evaluated can _see_ the type of distribution shift it would encounter during test time. And it's unclear whether the shift is the same in the hidden set.

There are questions about the evaluations that arise from the large model performance against the smaller models, especially given the ablation studies. Are the large models trained on the same data as these tiny models? Should they be? If they shouldn't, then why are we allowing these small models access to these in their training data?

shawntan commented on Less is more: Recursive reasoning with tiny networks alexiajm.github.io/2025/0... · Posted by u/guybedo

jononor · 5 months ago

I think that the best way to address this potential ARC overfitting, would be to create more benchmarks - that are similar in concept, focusing on fluid intelligence, but from another angle than ARC.

Of course it is quite costly and also requires some "marketing" to actually get it established.

shawntan · 5 months ago

This would not help if no proper constraints are established on what data can and cannot be trained on. And maybe just figuring out what the goal of the benchmark is.

If it is to test generalisation capability, then what data the model being evaluated is trained on is crucial to making any conclusions.

Look at the construction of this synthetic dataset for example: https://arxiv.org/pdf/1711.00350

shawntan commented on Less is more: Recursive reasoning with tiny networks alexiajm.github.io/2025/0... · Posted by u/guybedo

YeGoblynQueenne · 5 months ago

>> Now we don't know if we're generalising or memorising.

"Now" starts around 1980 I'd say. Everyone in the field tweaks their models until they perform well on the "held-out" test set, so any ability to estimate generalisation from test-set performance goes out the window. The standard 80/20 train/test split makes it even worse.

I personally find it kind of scandalous that nobody wants to admit this in the field and yet many people are happy to make big claims about generalisation, like e.g. the "mystery" of generalising overparameterised neural nets.

shawntan · 5 months ago

You can have benchmarks with specifically constructed train-test splits for task-specific models. Train only on the train, then your results on test should be what is reported.

You can still game those benchmarks (tune your hyperparameters after looking at test results), but that setting measures for generalisation on the test set _given_ the training set specified. Using any additional data should be going against the benchmark rules, and should not be compared on the same lines.

shawntan commented on Less is more: Recursive reasoning with tiny networks alexiajm.github.io/2025/0... · Posted by u/guybedo

ACCount37 · 5 months ago

ARC-AGI 1 and 2 are spatial reasoning benchmarks. ARC-AGI 3 is advanced spatial reasoning with agentic flavor.

They're adversarial benchmarks - they intentionally hit the weak point of existing LLMs. Not "AGI complete" by any means. But not useless either.

shawntan · 5 months ago

This is a point I wish more people would recognise.

shawntan commented on Less is more: Recursive reasoning with tiny networks alexiajm.github.io/2025/0... · Posted by u/guybedo

shawntan · 5 months ago

> Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?

Yes, precisely this. The question is really what is ARC-AGI evaluating for?

1. If the goal is to see if models can generalise to the ARC-AGI evals, then models being evaluated on it should not be trained on the tasks. Especially IF ARC-AGI evaluations are constructed to be OOD from the ARC-AGI training data. I don't know if they are. Further, there seems to be usage of the few-shot examples in the evals to construct more training data in the HRM case. TRM may do this via the training data via other means.

2. If the goal is that even _having seen_ the training examples, and creating more training examples (after having peeked at the test set), these evaluations should still be difficult, then the ablations show that you can get pretty far without universal/recurrent Transformers.

If 1, then I think the ARC-prize organisers should have better rules laid out for the challenge. From the blog post, I do wonder how far people will push the boundary (how much can I look at the test data to 'augment' my training data?) before the organisers say "This is explicitly not allowed for this challenge."

If 2, the organisers of the challenge should have evaluated how much of a challenge it would actually have been allowing extreme 'data augmentation', and maybe realised it wasn't that much of a challenge to begin with.

I tend to agree that, given the outcome of both the HRM and this paper, is that the ARC-AGI folks do seem to allow this setting, _and_ that the task isn't as "AGI complete" as it sets out to be.

shawntan · 5 months ago

I should probably also add: It's long been known that Universal / Recursive Transformers are able to solve _simple_ synthetic tasks that vanilla transformers cannot.

Just check out the original UT paper, or some of it's follow ups: Neural Data Router, https://arxiv.org/abs/2110.07732; Sparse Universal Transformers (SUT), https://arxiv.org/abs/2310.07096. There is even theoretical justification for why: https://arxiv.org/abs/2503.03961

The challenge is actually scaling them up to be useful as LLMs as well (I describe why it's a challenge in the SUT paper).

It's hard to say with the way ARC-AGI is allowed to be evaluated if this is actually what is at play. My gut tells me, given the type of data that's been allowed in the training set, that some leakage of the evaluation has happened in both HRM and TRM.

But because as a field we've given up on actually carefully ensuring training and test don't contaminate, we just decide it's fine and the effect is minimal. Especially considering LLMs, the test set example leaking into the dataset is merely a drop in the bucket (I don't believe we should be dismissing it this way, but that's a whole 'nother conversation).

With these models that are challenge-targeted, it becomes a much larger proportion of what influences the model behaviour, especially if the open evaluation sets are there for everyone to look at and simply generate more. Now we don't know if we're generalising or memorising.

shawntan commented on Less is more: Recursive reasoning with tiny networks alexiajm.github.io/2025/0... · Posted by u/guybedo

tsoj · 5 months ago

The TRM paper addresses this blog post. I don't think you need to read the HRM analysis very carefully, the TRM has the advantage of being disentangled compared to the HRM, making ablations easier. I think the real value of the arcprize HRM blog post is to highlight the importance of ablation testing.

I think ARC-AGI was supposed to be a challenge for any model. The assumption being that you'd need the reasoning abilities of large language models to solve it. It turns out that this assumption is somewhat wrong. Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?

shawntan · 5 months ago

> Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?

Yes, precisely this. The question is really what is ARC-AGI evaluating for?

1. If the goal is to see if models can generalise to the ARC-AGI evals, then models being evaluated on it should not be trained on the tasks. Especially IF ARC-AGI evaluations are constructed to be OOD from the ARC-AGI training data. I don't know if they are. Further, there seems to be usage of the few-shot examples in the evals to construct more training data in the HRM case. TRM may do this via the training data via other means.

2. If the goal is that even _having seen_ the training examples, and creating more training examples (after having peeked at the test set), these evaluations should still be difficult, then the ablations show that you can get pretty far without universal/recurrent Transformers.

If 1, then I think the ARC-prize organisers should have better rules laid out for the challenge. From the blog post, I do wonder how far people will push the boundary (how much can I look at the test data to 'augment' my training data?) before the organisers say "This is explicitly not allowed for this challenge."

If 2, the organisers of the challenge should have evaluated how much of a challenge it would actually have been allowing extreme 'data augmentation', and maybe realised it wasn't that much of a challenge to begin with.

I tend to agree that, given the outcome of both the HRM and this paper, is that the ARC-AGI folks do seem to allow this setting, _and_ that the task isn't as "AGI complete" as it sets out to be.

shawntan commented on Less is more: Recursive reasoning with tiny networks alexiajm.github.io/2025/0... · Posted by u/guybedo

ACCount37 · 5 months ago

Not exactly "vanilla Transformer", but rather "a Transformer-like architecture with recurrence".

Which is still a fun idea to play around with - this approach clearly has its strengths. But it doesn't appear to be an actual "better Transformer". I don't think it deserves nearly as much hype as it gets.

shawntan · 5 months ago

Right. There should really be a vanilla Transformer baseline.

With recurrence: The idea has been around: https://arxiv.org/abs/1807.03819

There are reasons why it hasn't really been picked up at scale, and the method tends to do well on synthetic tasks.

shawntan commented on Less is more: Recursive reasoning with tiny networks alexiajm.github.io/2025/0... · Posted by u/guybedo

ACCount37 · 5 months ago

Any mention of "HRM" is incomplete without this analysis:

https://arcprize.org/blog/hrm-analysis

This here looks like a stripped down version of HRM - possibly drawing on the ablation studies from this very analysis.

Worth noting that HRMs aren't generally applicable in the same way normal transformer LLMs are. Or, at least, no one has found a way to apply them to the typical generative AI tasks yet.

I'm still reading the paper, but I expect this version to be similar - it uses the same tasks as HRMs as examples. Possibly quite good at spatial reasoning tasks (ARC-AGI and ARC-AGI-2 are both spatial reasoning benchmarks), but it would have to be integrated into a larger more generally capable architecture to go past that.

shawntan · 5 months ago

That analysis provided a very non-abrasive wording of their evaluation of HRM and its contributions. The comparison with a recursive / universal transformer on the same settings is telling.

"These results suggest that the performance on ARC-AGI is not an effect of the HRM architecture. While it does provide a small benefit, a replacement baseline transformer in the HRM training pipeline achieves comparable performance."

shawntan commented on Less is more: Recursive reasoning with tiny networks alexiajm.github.io/2025/0... · Posted by u/guybedo

shawntan · 5 months ago

I think everyone should read the post from ARC-AGI organisers about HRM carefully: https://arcprize.org/blog/hrm-analysis

With the same data augmentation / 'test time training' setting, the vanilla Transformers do pretty well, close to the "breakthrough" HRM reported. From a brief skim, this paper is using similar settings to compare itself on ARC-AGI.

I too, want to believe in smaller models with excellent reasoning performance. But first understand what ARC-AGI tests for, what the general setting is -- the one that commercial LLMs use to compare against each other -- and what the specialised setting HRM and this paper uses as evaluation.

The naming of that benchmark lends itself to hype, as we've seen in both HRM and this paper.

shawntan commented on LLM-Deflate: Extracting LLMs into Datasets scalarlm.com/blog/llm-def... · Posted by u/gdiamos

apwell23 · 6 months ago

> This compression is lossy

Is compression really lossy? What is an example of lost knowledge?

shawntan · 6 months ago

Not sure if you mean in general, but I'll answer both branches of the question.

In general: Depending on the method of compression, you can have lossy or non-lossy compression. Using 7zip on a bunch of text files can lossless-ly compress that data. Briefly, you calculate the statistics of the data you want to compress (the dictionary), and then make the commonly re-occuring chunks describable with fewer bits (encoding). The compressed file basically contains the dictionary and the encoding.

For LLMs: There are ways to use an LLM (or any statistical model of text) to compress text data. But the techniques use similar settings as the above, with a dictionary and an encoding, with the LLM taking the function of a dictionary. When "extracting" data from the dictionary alone, you're basically sampling from the dictionary distribution.

Quantitatively, the "loss" in "lossy" being described is literally the number of bits used for the encoding.

I wrote a brief description here of techniques from an undergrad CS course that can be used: https://blog.wtf.sg/posts/2023-06-05-yes-its-just-doing-comp...