jmmcd (u/jmmcd) - Readit News

jmmcd commented on Gemini 3 blog.google/products/gemi... · Posted by u/preek

nomilk · a month ago

Terrence Tao claims [0] contributions by the public are counter-productive since the energy required to check a contribution outweighs its benefit:

> (for) most research projects, it would not help to have input from the general public. In fact, it would just be time-consuming, because error checking

Since frontier LLMs make clumsy mistakes, they may fall into this category of 'error-prone' mathematician whose net contributions are actually negative, despite being impressive some of the time.

[0] https://www.youtube.com/watch?v=HUkBz-cdB-k&t=2h59m33s

jmmcd · a month ago

But he actually uses frontier LLMs in his own work. Probably that's stronger evidence.

jmmcd commented on Gemini 3 blog.google/products/gemi... · Posted by u/preek

BoorishBears · a month ago

The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.

Like replacing named concepts with nonsense words in reasoning benchmarks.

jmmcd · a month ago

Yes. But "the gold standard" just means "the most natural, easy and dumb way".

jmmcd commented on Gemini 3 Pro Model Card [pdf] storage.googleapis.com/de... · Posted by u/virgildotcodes

jmmcd · a month ago

About ARC 2:

I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.

But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.

jmmcd · a month ago

This is on the semi-private set

* https://x.com/arcprize/status/1990820655411909018

* https://arcprize.org/guide

jmmcd commented on Gemini 3 blog.google/products/gemi... · Posted by u/preek

burkaman · a month ago

Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

jmmcd · a month ago

"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.

jmmcd commented on Gemini 3 Pro Model Card [pdf] storage.googleapis.com/de... · Posted by u/virgildotcodes

TheAceOfHearts · a month ago

They scored a 31.1% on ARC AGI 2 which puts them in first place.

Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.

jmmcd · a month ago

About ARC 2:

But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.

jmmcd commented on Altered states of consciousness induced by breathwork accompanied by music journals.plos.org/plosone... · Posted by u/gnabgib

AlecSchueler · 4 months ago

I have no problem with the subject matter and routinely hack my own firmware, I'm just clarifying the point that you seemed to miss. This thread is full of the kind of anecdotal evidence that would be laughed out of the room on any other day. That's not a judgement it's just a fact.

And actually, if I do have a problem it's quite the opposite of what you're suggesting: I'd like us to give more weight to the lived experience of others even in other contexts and regarding other subject matters.

jmmcd · 4 months ago

On HN it's very common to see a blog post along the lines of "I found this old piece of equipment with no brand name, I used some network traffic inspection to figure out what it does, I hacked around a bit, I got it working and turned it into a self-ringing doorbell with wifi" (or whatever). All of that is anecdotal, N=1, "I did what worked for me, I hope it's interesting to you". And those posts are highly prized and rightly so.

jmmcd commented on Quantitative AI progress needs accurate and transparent evaluation mathstodon.xyz/@tao/11491... · Posted by u/bertman

ipnon · 5 months ago

Tao’s commentary is more practical and insightful than all of the “rationalist” doomers put together.

jmmcd · 5 months ago

(a) no it's not

(b) your comment is miles off-topic, as he is not addressing doom in any sense

jmmcd commented on 'Positive review only': Researchers hide AI prompts in papers asia.nikkei.com/Business/... · Posted by u/ohjeez

seydor · 6 months ago

These were preprints that have not been reviewed or published

jmmcd · 6 months ago

But they're submissions to ICML.

jmmcd commented on The ‘white-collar bloodbath’ is all part of the AI hype machine cnn.com/2025/05/30/busine... · Posted by u/lwo32k

jmmcd · 7 months ago

It could certainly replace the author of this article.

jmmcd commented on Trying to teach in the age of the AI homework machine solarshades.club/p/dispat... · Posted by u/notarobot123

ccppurcell · 7 months ago

If there is another course where students design their own programming language, maybe you could use the best of the previous year's. That way LLMs are unlikely to be able to (easily) produce correct syntax. Just a thought from someone who teaches in a totally different neck of the mathematical/computational woods.

jmmcd · 7 months ago

Modern LLMs can one-shot code in a totally new language, if you provide the language manual. And you have to provide the language manual, because otherwise how can the students learn the language.

u/jmmcd

KarmaCake day1294September 30, 2009

About

I'm a lecturer and researcher in computer science and AI in Ireland. I'm into music, computer music, programs, and programs that write programs.

View Original