Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."
I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:
Roughly 3 million shipwrecks on ocean floors globally
Maybe 1 in 1000 ships historically carried a piano (passenger ships, luxury vessels)
So ~3,000 ships with pianos sunk
Average maybe 0.5 pianos per ship (not all passenger areas had them)
Estimate: ~1,500 pianos
*Claude Sonnet 4, Google Gemini 2.5 and GPT 4oCombining our estimates:
From Shipwrecks: 12,500 From Dumping: 1,000 From Catastrophes: 500 Total Estimated Pianos at the Bottom of the Sea ≈ 14,000
Also I have to point out that 4o isn't a reasoning model and neither is Sonnet 4, unless thinking mode was enabled.
I don't get this argument. The paper is about "whether RLLMs can think". If we grant "humans make these mistakes too", but also "we still require this ability in our definition of thinking", aren't we saying "thinking in humans is a illusion" too?
I.e. to what extent are LLMs able to reliably make use of writing code or using logic systems, and to what extent does hallucinating / providing faulty answers in the absence of such tool access demonstrate an inability to truly reason (I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?
It's especially weird argument considering that LLMs are already ahead of humans in Tower of Hanoi. I bet average person will not be able to "one-shot" you the moves to 8 disk tower of Hanoi without writing anything down or tracking the state with the actual disks. LLMs have far bigger obstacles to reaching AGI though.
5 is also a massive strawman with the "not see how well it could use preexisting code retrieved from the web" as well, given that these models will write code to solve these kind of problems even if you come up with some new problem that wouldn't exist in its training data.
Most of these are just valid the issues in the paper. They're not supposed to be some kind of arguments that try to make everything the paper said invalid. The paper didn't really even make any bold claims, it only concluded LLMs have limitations in its reasoning. It had a catchy title and many people didn't read past that.
Electron comes out looking competitive at runtime! IMO people over-fixate on disc space instead of runtime memory usage.
Memory Usage with a single window open (Release builds)
Windows (x64): 1. Electron: ≈93MB 2. NodeGui: ≈116MB 3. NW.JS: ≈131MB 4. Tauri: ≈154MB 5. Wails: ≈163MB 6. Neutralino: ≈282MB
MacOS (arm64): 1. NodeGui: ≈84MB 2. Wails: ≈85MB 3. Tauri: ≈86MB 4. Neutralino: ≈109MB 5. Electron: ≈121MB 6. NW.JS: ≈189MB
Linux (x64): 1. Tauri: ≈16MB 2. Electron: ≈70MB 3. Wails: ≈86MB 4. NodeGui: ≈109MB 5. NW.JS: ≈166MB 6. Neutralino: ≈402MB
The security posture for the code running in the browser is very different from the code running on a trusted backend.
A separation of concerns allows one to have two codebases, one frontend (untrustworthy but limited access) and one backend (trustworthy but a lot of access).
2025-02-27T06:03Z: Disclosure to Next.js team via GitHub private vulnerability reporting
2025-03-14T17:13Z: Next.js team started triaging the report
https://zeropath.com/blog/nextjs-middleware-cve-2025-29927-a...
This looks trivially easy to bypass.
More generally, the entire concept of using middleware which communicates using the same mechanism that is also used for untrusted user input seems pretty wild to me. It divorces the place you need to write code for user request validation (as soon as the user request arrives) from the middleware itself.
Allowing ANY headers from the user except a whitelisted subset also seems like an accident waiting to happen. I think the mindset of ignoring unknown/invalid parts of a request as long as some of it is valid also plays a role.
The framework providing crutches for bad server design is also a consequence of this mindset - are there any concrete use cases where the flow for processing a request should not be a DAG? Allowing recursive requests across authentication boundaries seems like a problem waiting to happen as well.
Nobody can really have any sources for RabbitMQ being able to do it if you don't know what it supposedly cannot do. The way you descibed it, is that you simply read data and then did something with the data and passed it to somewhere else. RabbitMQ obviously can do it.
You make a good point though that the question of whether LLMs reason or not should not be conflated with the question of whether they're on the pathway to AGI or not.