Readit News logoReadit News
jmmcd commented on Gemini 3   blog.google/products/gemi... · Posted by u/preek
nomilk · a month ago
Terrence Tao claims [0] contributions by the public are counter-productive since the energy required to check a contribution outweighs its benefit:

> (for) most research projects, it would not help to have input from the general public. In fact, it would just be time-consuming, because error checking

Since frontier LLMs make clumsy mistakes, they may fall into this category of 'error-prone' mathematician whose net contributions are actually negative, despite being impressive some of the time.

[0] https://www.youtube.com/watch?v=HUkBz-cdB-k&t=2h59m33s

jmmcd · a month ago
But he actually uses frontier LLMs in his own work. Probably that's stronger evidence.
jmmcd commented on Gemini 3   blog.google/products/gemi... · Posted by u/preek
BoorishBears · a month ago
The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.

Like replacing named concepts with nonsense words in reasoning benchmarks.

jmmcd · a month ago
Yes. But "the gold standard" just means "the most natural, easy and dumb way".
jmmcd commented on Gemini 3 Pro Model Card [pdf]   storage.googleapis.com/de... · Posted by u/virgildotcodes
jmmcd · a month ago
About ARC 2:

I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.

But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.

jmmcd commented on Gemini 3   blog.google/products/gemi... · Posted by u/preek
burkaman · a month ago
Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
jmmcd · a month ago
"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.
jmmcd commented on Gemini 3 Pro Model Card [pdf]   storage.googleapis.com/de... · Posted by u/virgildotcodes
TheAceOfHearts · a month ago
They scored a 31.1% on ARC AGI 2 which puts them in first place.

Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.

jmmcd · a month ago
About ARC 2:

I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.

But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.

jmmcd commented on Altered states of consciousness induced by breathwork accompanied by music   journals.plos.org/plosone... · Posted by u/gnabgib
AlecSchueler · 4 months ago
I have no problem with the subject matter and routinely hack my own firmware, I'm just clarifying the point that you seemed to miss. This thread is full of the kind of anecdotal evidence that would be laughed out of the room on any other day. That's not a judgement it's just a fact.

And actually, if I do have a problem it's quite the opposite of what you're suggesting: I'd like us to give more weight to the lived experience of others even in other contexts and regarding other subject matters.

jmmcd · 4 months ago
On HN it's very common to see a blog post along the lines of "I found this old piece of equipment with no brand name, I used some network traffic inspection to figure out what it does, I hacked around a bit, I got it working and turned it into a self-ringing doorbell with wifi" (or whatever). All of that is anecdotal, N=1, "I did what worked for me, I hope it's interesting to you". And those posts are highly prized and rightly so.
jmmcd commented on Quantitative AI progress needs accurate and transparent evaluation   mathstodon.xyz/@tao/11491... · Posted by u/bertman
ipnon · 5 months ago
Tao’s commentary is more practical and insightful than all of the “rationalist” doomers put together.
jmmcd · 5 months ago
(a) no it's not

(b) your comment is miles off-topic, as he is not addressing doom in any sense

jmmcd commented on 'Positive review only': Researchers hide AI prompts in papers   asia.nikkei.com/Business/... · Posted by u/ohjeez
seydor · 6 months ago
These were preprints that have not been reviewed or published
jmmcd · 6 months ago
But they're submissions to ICML.
jmmcd commented on The ‘white-collar bloodbath’ is all part of the AI hype machine   cnn.com/2025/05/30/busine... · Posted by u/lwo32k
jmmcd · 7 months ago
It could certainly replace the author of this article.
jmmcd commented on Trying to teach in the age of the AI homework machine   solarshades.club/p/dispat... · Posted by u/notarobot123
ccppurcell · 7 months ago
If there is another course where students design their own programming language, maybe you could use the best of the previous year's. That way LLMs are unlikely to be able to (easily) produce correct syntax. Just a thought from someone who teaches in a totally different neck of the mathematical/computational woods.
jmmcd · 7 months ago
Modern LLMs can one-shot code in a totally new language, if you provide the language manual. And you have to provide the language manual, because otherwise how can the students learn the language.

u/jmmcd

KarmaCake day1294September 30, 2009
About
I'm a lecturer and researcher in computer science and AI in Ireland. I'm into music, computer music, programs, and programs that write programs.
View Original