Grok 3: Another win for the bitter lesson

This article is weak and just general speculation.

Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:

> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all other LLMs I have asked because it just repeats confused stuff that has been written elsewhere rather than looking at the actual theorem.

https://x.com/skdh/status/1892432032644354192

Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".

cardanome · 6 months ago

> Sabine Hossenfelder

She really needs to stop commenting on topics outside of theoretical physics.

Even in physics she does not represent the scientific consensus but has some very questionable fringe beliefs like labeling whole sub-fields as "scams to get funding".

She regularly speaks with "scientific authority" about topics she barely knows anything about.

Her video on autism is considered super harmful and misleading by actual autistic people. She also thinks she is an expert on trans-issues and climate change. And I doubt she know enough about artificial intelligence and computer science to comment on LLMs.

Mekoloto · 6 months ago

Your statement is missleading.

She doesn't say she is an expert on trans-issues at all! She analyzed the studies and looked at data and stated that there is no real transpendemic but highlighed an statistical increased numbers in young woman without stating a clear opinion on this finding.

The climate change videos do the same thing. She evaluates these studies discusses them to clarify that for her, certain numbers are unspecific and she also is not coming to a clear conclusion in sense of climate change yes, no, bad, good.

She is for sure not an expert in all fields, but her way of discussing these topics are based on studies, numbers and is a good viewpoint.

The funding scam you mention is a reference of "these people get billions for particle research but the outcome for us as society is way to small"

dimal · 6 months ago

> Her video on autism is considered super harmful and misleading by actual autistic people.

I’m autistic and I just watched her video. I found it to be one of the best primers on autism I’ve seen. Not complete, of course, and there’s a lot more nuance to it, but very even handed. She doesn’t make any judgements. She just gives the history and talks about the controversies without choosing sides, except to say that the idea of neurodiversity seems reasonable to her. When compared to most of the discourse about autism, it stands up pretty well. Of course, there’s a lot more I want people to know about autism, but it’s an ok start.

Actually, many autistic people (myself included) would find your statement far more harmful. You assume that all autistic people think alike and believe the same thing. This is false. You try to defend us without understanding us.

Don’t do that.

I suppose there’s a possibility that you’re autistic and found it harmful to you. If so, don’t speak for me.

And she was commenting on an AI’s knowledge of Bell’s Inequality, which is PHYSICS. If she can’t comment on that, who can?

dauertewigkeit · 6 months ago

I agree with you that Sabine often talks about matters far outside of her expertise, but as somebody with a foot in academia, I would bet that a very large number of academics have at least one academic research direction in mind that they would categorize as a "scam to get funding".

hitekker · 6 months ago

I was nodding along to your comment, at first. But then I read your follow-ups, which look like you're covering something up that you fear could be true.

I don't know what that something is, so I think I should go listen to Sabine Hossenfelder.

netbioserror · 6 months ago

Based on what I've seen of Sabine, virtually all of this post is lies. She regularly positions herself as an outside skeptic and critic. Do you have any examples or her claiming authority or representing consensus?

me_me_me · 6 months ago

but Bell Theorem IS physics! so according to you she absolutely can comment on LLMs understanding of physics or lack of it.

so your whole rant makes no sense

Der_Einzige · 6 months ago

She’s very full of shit and feels a lot like a lex Friedman for women.

I can’t wait for others to call her further out for being herself the biggest grifter of them all, bigger than most she tries to take down.

idiotsecant · 6 months ago

Yes, she is the worst type of 'vibes based' science communicator and mainly just says edgy things to improve click rate and drive engagement.

ttoinou · 6 months ago

   Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks

That's something I always wondered about, Goodhart's law is so obvious to apply to each new AI release. Even the fact that writers and journalists don't mention that possibility makes me instantly skeptical about the quality of the article I'm reading

NitpickLawyer · 6 months ago

> Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks

2 anecdotes here:

- just before grok2 was released, they put it on livearena under a pseudonim. If you read the topics (reddit,x,etc) when that hit, everyone was raving about the model. People were saying it's the next 4o, that it's so good, hyped, so on. Then it launched, and they revealed the pseudonim, and everyone started shitting on it. There is a lot of bias in this area, especially with anything touching bad spaceman, so take "many people doubt" with a huge grain of salt. People be salty.

- there are benchmarks that seem to correlate very well with end to end results on a variety of tasks. Livebench is one of them. Models scoring highly there have proven to perform well on general tasks, and don't feel like they cheated. This is supported by the finding in that paper that found models like phi and qwen to lose ~10-20% of their benchmarks scores when checked against newly-built, unseen but similar tasks. Models scoring strongly on livebench didn't see that big of a gap.

aubanel · 6 months ago

> Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".

Do you have any data to support 1. That grok is not more intelligent than previous models (you gave one anecdotal datapoint), and 2. That it was trained on more data than other models like o1 and Claude-3.5 Sonnet?

All datapoints I have support the opposite: scaling actually increases intelligence of models. (agreed, calling this "intelligence" might be a stretch, but alternative definitions like "scope, maybe, or flexibility, or coverage, or something" seem to me like beating around the bush to avoid saying that machines have intelligence)

Check out the technical report of Llama 3 for instance, with nice figures on how scaling up model training does increase performance on intelligence tests (might as well call that intelligence): https://arxiv.org/abs/2407.21783

melodyogonna · 6 months ago

How can it be specifically trained on benchmarks when it is leading on blind chatbot tests?

The post you quoted is not a Grok problem if other LLMs are also failing, it seems, to me, to be a fundamental failure in the current approach to AI model development.

bearjaws · 6 months ago

Any LLM that is uncensored does well on Chatbot tests because a refusal is an automatic loss.

And since 30% of people using Chatbots are Gooning it up theres far more refusals...

nycdatasci · 6 months ago

I think a more plausible path to gaming benchmarks would be to use watermarks in text output to identify your model, then unleash bots to consistently rank your model over opponents.

aucisson_masque · 6 months ago

Last time I used chatbox arena, I was the one to ask question to LLM and so I made my own benchmark. There wasn't any predefined question.

How could Musk LLM train on data that does not yet exist ?

HenryBemis · 6 months ago

That. I have used only ChatGPT and I remember asking 4 legacy to write some code. I asked o3 the same question when it came out, and then I compared the codes. o3 was 'better' more precise, more detailed, less 'crude'. Now, don't get me wrong, crude worked fine. But when I wanted to do the v1.1 and v1.2 o3 nailed it every time, while 4 legacy was simply bad and full of errors.

With that said, I assume that every 'next' version of each engine is using my 'prompts' to train, so each new version has the benefit of having already processed my initial v1.0 and then v1.1 and then v1.2. So it is somewhat 'unfair' because for "ChatGTP v2024" my v1.0 is brand new while for "ChatGTP v2027" my v1.0, v1.1, v1.2 is already in the training dataset.

I haven't used Grok yet, perhaps it's time to pause that OpenAI payment and give Elon some $$$ and see how it works 'for me'.

JKCalhoun · 6 months ago

That's true. You can head over to lmarena.ai and pit it against other LLMs yourself. I only tried two prompts but was surprised at how well it did.

There are "leaderboards" there that provide more anecdotal data points than my two.

jiggawatts · 6 months ago

People have called LLMs a "blurry picture of the Internet". Improving the focus won't change the subject of the picture, it just makes it sharper. Every photographer knows this!

A fundamentally new approach is needed, such as training AIs in phases, where instead of merely training them to learn to parrot their inputs, the first AI is used to critique and analyse the inputs, which is then used to train another model in a second pass, which is used to critique the data again, and so on, probably for half a dozen or more iterations. On each round, the model can learn not just what it heard, but also an analysis of the veracity, validity, and consistency.

Notably, something akin to this was done for training Deepseek, but only in a limited fashion.

BiteCode_dev · 6 months ago

It is very up to date however, I asked it about recent stuff on python packaging, and it gets it while others don't.

Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0.5 billion in talent. Deepseek, would invest $1 billion in GPUs and $2 billion in talent.

I would argue that the latter approach (Deepseek's) is more scalable. It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

mike_hearn · 6 months ago

We don't actually know how much money DeepSeek spent or how much compute they used. The numbers being thrown around are suspect, the paper they published didn't reveal the costs of all models nor the R&D cost it took to develop them.

In any AI R&D operation the bulk of the compute goes on doing experiments, not on the final training run for whatever models they choose to make available.

wallaBBB · 6 months ago

One thing I (intuitively) don't doubt - that they spent less money for developing R1 than OpenAI spent on marketing, lobbying and management compensation.

tw1984 · 6 months ago

> The numbers being thrown around are suspect, the paper they published didn't reveal the costs of all models nor the R&D cost it took to develop them.

did any lab release such figure? will be interesting to see.

sigmoid10 · 6 months ago

>It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

The article explains how in reality the opposite is true. Especially when you look at it long term. Compute power grows exponentially, humans do not.

llm_trw · 6 months ago

If the bitter lesson were true we'd be getting sota results out of two layer neural networks using tanh as activation functions.

It's a lazy blog post that should be thrown out after a minute of thought by anyone in the field.

OtherShrezzing · 6 months ago

Humans don't grow exponentially indefinitely. But there's only something in the order of 100k AI researchers employed in the big labs right now. Meanwhile, there's around 20mn software engineers globally, and around 200k math graduates per year.

The number of humans who could feasibly work on this problem is pretty high, and the labs could grow an order of magnitude, and still only be tapping into the top 1-2% of engineers & mathematicians. They could grow two orders of magnitude before they've absorbed all of the above-average engineers & mathematicians in the world.

smy20011 · 6 months ago

Human do write code that scalable with compute.

The performance is always raw performance * software efficiency. You can use shitty software and waste all these FLOPs.

alecco · 6 months ago

Algorithmic improvements in new fields are often bigger than hardware improvements.

stpedgwdgfhgdd · 6 months ago

Large amounts of teams are very hard to scale.

There is a reason why startups innovate and large companies follow.

mirekrusin · 6 months ago

Deepseek innovation is applicable to xAI setup - results are simply multiply of their compute scale.

Deepseek didn’t have option A or B available, they only had extreme optimisation option to work with.

It’s weird that people present those two approaches as mutually exclusive ones.

PeterStuer · 6 months ago

It's not an either/or. Your hiring of talent is only limited by your GPU spend if you can't hire because you ran out of money.

In reality pushing the frontier on datacenters will tend to attract the best talent, not turn them away.

And in talent, it is the quality rather than the quantity that counts.

A 10x breakthrough in algorithm will compound with a 10x scaleout in compute, not hinder it.

I am a big fan of Deepseek, Meta and other open model groups. I also admire what the Grok team is doing, especially their astounding execution velocity.

And it seems like Grok 2 is scheduled to be opened as promised.

krainboltgreene · 6 months ago

Have fun hiring any talent after three years of advertising to students that all programming/data jobs are going to be obsolete.

smy20011 · 6 months ago

Not that simple, It could cause resource curse [1] for developers. Why optimize algorithm when you have nearly infinity resources? For deepseek, their constrains is one of the reason they achieve breakthrough. One of their contribution, fp8 training, is to find a way to train models with GPUs that limit fp32 performance due to export control.

[1]: https://www.investopedia.com/terms/r/resource-curse.asp#:~:t...

SamPatt · 6 months ago

R1 came out when Grok 3's training was still ongoing. They shared their techniques freely, so you would expect the next round of models to incorporate as many of those techniques as possible. The bump you would get from the extra compute occurs in the next cycle.

If Musk really can get 1 million GPUs and they incorporate some algorithmic improvements, it'll be exciting to see what comes out.

dogma1138 · 6 months ago

Deepseek didn’t seem to invest in talent as much as it did in smuggling restricted GPUs into China via 3rd countries.

Also not for nothing scaling compute x100 or even x1000 is much easier than scaling talent by x10 or even x2 since you don’t need workers you need discovery.

tw1984 · 6 months ago

Talent is not something you can just freely pick up from your local Walmart.

wordofx · 6 months ago

Deepseek was a crypto mining operation before they pivoted to AI. They have an insane amount of GPUs laying around. So we have no idea how much compute they have compared to xAI.

oskarkk · 6 months ago

Do you have any sources for that? When I searched "DeepSeek crypto mining" the first result was your comment, the other results were just about the wide tech market selloff after DeepSeek appeared (that also affected crypto). As far as I know, they had many GPUs because their parent company was using AI algorithms for trading for many years.

https://en.wikipedia.org/wiki/High-Flyer

miki123211 · 6 months ago

Crypto GPUs have nothing to do with AI GPUs.

Crypto mining is an embarassingly parallel problem, requiring little to no communication between GPUs. To a first approximation, in crypto, 10x-ing the amount of "cores" per GPU, 10x-ing the number of GPUs per rig and 10X-ing the number of rigs you own is basically equivalent. An infinite amount of extremely slow GPUs would do just as well as one infinitely fast GPU. This is why consumer GPUs are great for crypto.

AI is the opposite. In AI, you need extremely fast communication between GPUs. This means getting as much memory per GPU as possible (to make communication less necessary), and putting all the GPUs all in one datacenter.

Consumer GPUs, which were used for crypto, don't support the fast communication technologies needed for AI training, and they don't come in the 80gb memory versions that AI labs need. This is Nvidia's price differentiation strategy.

oskarkk · 6 months ago

> While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

I'm not sure if it's close to 100x more. xAI had 100K Nvidia H100s, while this is what SemiAnalysis writes about DeepSeek:

> We believe they have access to around 50,000 Hopper GPUs, which is not the same as 50,000 H100, as some have claimed. There are different variations of the H100 that Nvidia made in compliance to different regulations (H800, H20), with only the H20 being currently available to Chinese model providers today. Note that H800s have the same computational power as H100s, but lower network bandwidth.

> We believe DeepSeek has access to around 10,000 of these H800s and about 10,000 H100s. Furthermore they have orders for many more H20’s, with Nvidia having produced over 1 million of the China specific GPU in the last 9 months. These GPUs are shared between High-Flyer and DeepSeek and geographically distributed to an extent. They are used for trading, inference, training, and research. For more specific detailed analysis, please refer to our Accelerator Model.

> Our analysis shows that the total server CapEx for DeepSeek is ~$1.6B, with a considerable cost of $944M associated with operating such clusters. Similarly, all AI Labs and Hyperscalers have many more GPUs for various tasks including research and training then they they commit to an individual training run due to centralization of resources being a challenge. X.AI is unique as an AI lab with all their GPUs in 1 location.

https://semianalysis.com/2025/01/31/deepseek-debates/

I don't know how much slower are these GPUs that they have, but if they have 50K of them, that doesn't sound like 100x less compute to me. Also, a company that has N GPUs and trains AI on them for 2 months can achieve the same results as a company that has 2N GPUs and trains for 1 month. So DeepSeek could spend a longer time training to offset the fact that have less GPUs than competitors.

cma · 6 months ago

Having 50K of them isn't the same thing as 50K in one high bandwidth cluster, right? x.AI has all theirs so far in one connected cluster, and all of homogenous H100s right?

_giorgio_ · 6 months ago

Deepseek spent at least 1.5 billion on hardware.