Readit News logoReadit News
bambax · 6 months ago
This article is weak and just general speculation.

Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:

> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all other LLMs I have asked because it just repeats confused stuff that has been written elsewhere rather than looking at the actual theorem.

https://x.com/skdh/status/1892432032644354192

Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".

cardanome · 6 months ago
> Sabine Hossenfelder

She really needs to stop commenting on topics outside of theoretical physics.

Even in physics she does not represent the scientific consensus but has some very questionable fringe beliefs like labeling whole sub-fields as "scams to get funding".

She regularly speaks with "scientific authority" about topics she barely knows anything about.

Her video on autism is considered super harmful and misleading by actual autistic people. She also thinks she is an expert on trans-issues and climate change. And I doubt she know enough about artificial intelligence and computer science to comment on LLMs.

Mekoloto · 6 months ago
Your statement is missleading.

She doesn't say she is an expert on trans-issues at all! She analyzed the studies and looked at data and stated that there is no real transpendemic but highlighed an statistical increased numbers in young woman without stating a clear opinion on this finding.

The climate change videos do the same thing. She evaluates these studies discusses them to clarify that for her, certain numbers are unspecific and she also is not coming to a clear conclusion in sense of climate change yes, no, bad, good.

She is for sure not an expert in all fields, but her way of discussing these topics are based on studies, numbers and is a good viewpoint.

The funding scam you mention is a reference of "these people get billions for particle research but the outcome for us as society is way to small"

dimal · 6 months ago
> Her video on autism is considered super harmful and misleading by actual autistic people.

I’m autistic and I just watched her video. I found it to be one of the best primers on autism I’ve seen. Not complete, of course, and there’s a lot more nuance to it, but very even handed. She doesn’t make any judgements. She just gives the history and talks about the controversies without choosing sides, except to say that the idea of neurodiversity seems reasonable to her. When compared to most of the discourse about autism, it stands up pretty well. Of course, there’s a lot more I want people to know about autism, but it’s an ok start.

Actually, many autistic people (myself included) would find your statement far more harmful. You assume that all autistic people think alike and believe the same thing. This is false. You try to defend us without understanding us.

Don’t do that.

I suppose there’s a possibility that you’re autistic and found it harmful to you. If so, don’t speak for me.

And she was commenting on an AI’s knowledge of Bell’s Inequality, which is PHYSICS. If she can’t comment on that, who can?

dauertewigkeit · 6 months ago
I agree with you that Sabine often talks about matters far outside of her expertise, but as somebody with a foot in academia, I would bet that a very large number of academics have at least one academic research direction in mind that they would categorize as a "scam to get funding".
hitekker · 6 months ago
I was nodding along to your comment, at first. But then I read your follow-ups, which look like you're covering something up that you fear could be true.

I don't know what that something is, so I think I should go listen to Sabine Hossenfelder.

netbioserror · 6 months ago
Based on what I've seen of Sabine, virtually all of this post is lies. She regularly positions herself as an outside skeptic and critic. Do you have any examples or her claiming authority or representing consensus?
me_me_me · 6 months ago
but Bell Theorem IS physics! so according to you she absolutely can comment on LLMs understanding of physics or lack of it.

so your whole rant makes no sense

Der_Einzige · 6 months ago
She’s very full of shit and feels a lot like a lex Friedman for women.

I can’t wait for others to call her further out for being herself the biggest grifter of them all, bigger than most she tries to take down.

idiotsecant · 6 months ago
Yes, she is the worst type of 'vibes based' science communicator and mainly just says edgy things to improve click rate and drive engagement.
ttoinou · 6 months ago

   Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks
That's something I always wondered about, Goodhart's law is so obvious to apply to each new AI release. Even the fact that writers and journalists don't mention that possibility makes me instantly skeptical about the quality of the article I'm reading

NitpickLawyer · 6 months ago
> Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks

2 anecdotes here:

- just before grok2 was released, they put it on livearena under a pseudonim. If you read the topics (reddit,x,etc) when that hit, everyone was raving about the model. People were saying it's the next 4o, that it's so good, hyped, so on. Then it launched, and they revealed the pseudonim, and everyone started shitting on it. There is a lot of bias in this area, especially with anything touching bad spaceman, so take "many people doubt" with a huge grain of salt. People be salty.

- there are benchmarks that seem to correlate very well with end to end results on a variety of tasks. Livebench is one of them. Models scoring highly there have proven to perform well on general tasks, and don't feel like they cheated. This is supported by the finding in that paper that found models like phi and qwen to lose ~10-20% of their benchmarks scores when checked against newly-built, unseen but similar tasks. Models scoring strongly on livebench didn't see that big of a gap.

aubanel · 6 months ago
> Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".

Do you have any data to support 1. That grok is not more intelligent than previous models (you gave one anecdotal datapoint), and 2. That it was trained on more data than other models like o1 and Claude-3.5 Sonnet?

All datapoints I have support the opposite: scaling actually increases intelligence of models. (agreed, calling this "intelligence" might be a stretch, but alternative definitions like "scope, maybe, or flexibility, or coverage, or something" seem to me like beating around the bush to avoid saying that machines have intelligence)

Check out the technical report of Llama 3 for instance, with nice figures on how scaling up model training does increase performance on intelligence tests (might as well call that intelligence): https://arxiv.org/abs/2407.21783

melodyogonna · 6 months ago
How can it be specifically trained on benchmarks when it is leading on blind chatbot tests?

The post you quoted is not a Grok problem if other LLMs are also failing, it seems, to me, to be a fundamental failure in the current approach to AI model development.

bearjaws · 6 months ago
Any LLM that is uncensored does well on Chatbot tests because a refusal is an automatic loss.

And since 30% of people using Chatbots are Gooning it up theres far more refusals...

nycdatasci · 6 months ago
I think a more plausible path to gaming benchmarks would be to use watermarks in text output to identify your model, then unleash bots to consistently rank your model over opponents.
aucisson_masque · 6 months ago
Last time I used chatbox arena, I was the one to ask question to LLM and so I made my own benchmark. There wasn't any predefined question.

How could Musk LLM train on data that does not yet exist ?

HenryBemis · 6 months ago
That. I have used only ChatGPT and I remember asking 4 legacy to write some code. I asked o3 the same question when it came out, and then I compared the codes. o3 was 'better' more precise, more detailed, less 'crude'. Now, don't get me wrong, crude worked fine. But when I wanted to do the v1.1 and v1.2 o3 nailed it every time, while 4 legacy was simply bad and full of errors.

With that said, I assume that every 'next' version of each engine is using my 'prompts' to train, so each new version has the benefit of having already processed my initial v1.0 and then v1.1 and then v1.2. So it is somewhat 'unfair' because for "ChatGTP v2024" my v1.0 is brand new while for "ChatGTP v2027" my v1.0, v1.1, v1.2 is already in the training dataset.

I haven't used Grok yet, perhaps it's time to pause that OpenAI payment and give Elon some $$$ and see how it works 'for me'.

JKCalhoun · 6 months ago
That's true. You can head over to lmarena.ai and pit it against other LLMs yourself. I only tried two prompts but was surprised at how well it did.

There are "leaderboards" there that provide more anecdotal data points than my two.

jiggawatts · 6 months ago
People have called LLMs a "blurry picture of the Internet". Improving the focus won't change the subject of the picture, it just makes it sharper. Every photographer knows this!

A fundamentally new approach is needed, such as training AIs in phases, where instead of merely training them to learn to parrot their inputs, the first AI is used to critique and analyse the inputs, which is then used to train another model in a second pass, which is used to critique the data again, and so on, probably for half a dozen or more iterations. On each round, the model can learn not just what it heard, but also an analysis of the veracity, validity, and consistency.

Notably, something akin to this was done for training Deepseek, but only in a limited fashion.

BiteCode_dev · 6 months ago
It is very up to date however, I asked it about recent stuff on python packaging, and it gets it while others don't.
bccdee · 6 months ago
The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite "throwing more compute at Grok 3 than even OpenAI could" is further evidence that hyper-scaling is a dead end which will only yield incremental improvements.

Obviously more computing power makes the computer better. That is a completely banal observation. The rest of this 2000-word article is groping around for a way to take an insight based on the difference between '70s symbolic AI and the neural networks of the 2010s and apply it to the difference between GPT-4 and Grok 3 off the back of a single set of benchmarks. It's a bad article.

starspangled · 6 months ago
> The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws.

Just based on the comparisons linked in the article, it's not "co-state-of-the-art", it's the clear leader. You might argue those numbers are wrong or not representative, but you can't accept them then claim it's not outperforming existing models.

bccdee · 6 months ago
The leader, perhaps, but not by a large margin, and only on these sample benchmarks. "Co-state-of-the-art" is the term used in the article, and I'm going to take that at face value.
horsawlarway · 6 months ago
I agree.

There's a lot of attention being paid to metrics that often don't align all that well with actual production use-cases, and frankly the metrics are good but hardly breath-taking.

They have an absolutely insane outlay of additional compute, which appears to have given them a relatively paltry increase in capabilities.

15 times the compute for 5-15% better performance is basically the exact opposite of the bitter lesson.

Hell - it genuinely seems like the author didn't even read the actual bitter lesson.

The lesson is not "scale always wins" the lesson was "We have to learn the bitter lesson that building in how we think we think does not work in the long run."

And somewhat ironically - the latest advances seem to genuinely undermine the lesson. It turns out that building in reasoning/thinking (a heuristic that copies human behavior) is the biggest performance jump we've seen in the last year.

Does that mean we won't scale out of the current capabilities? No, we definitely might. But we also definitely might not.

The diminishing returns we're seeing for scale hint strongly that just throwing more compute at the problem is not enough by itself. Possibly still required, but definitely not sufficient.

smy20011 · 6 months ago
Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0.5 billion in talent. Deepseek, would invest $1 billion in GPUs and $2 billion in talent.

I would argue that the latter approach (Deepseek's) is more scalable. It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

mike_hearn · 6 months ago
We don't actually know how much money DeepSeek spent or how much compute they used. The numbers being thrown around are suspect, the paper they published didn't reveal the costs of all models nor the R&D cost it took to develop them.

In any AI R&D operation the bulk of the compute goes on doing experiments, not on the final training run for whatever models they choose to make available.

wallaBBB · 6 months ago
One thing I (intuitively) don't doubt - that they spent less money for developing R1 than OpenAI spent on marketing, lobbying and management compensation.
tw1984 · 6 months ago
> The numbers being thrown around are suspect, the paper they published didn't reveal the costs of all models nor the R&D cost it took to develop them.

did any lab release such figure? will be interesting to see.

sigmoid10 · 6 months ago
>It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

The article explains how in reality the opposite is true. Especially when you look at it long term. Compute power grows exponentially, humans do not.

llm_trw · 6 months ago
If the bitter lesson were true we'd be getting sota results out of two layer neural networks using tanh as activation functions.

It's a lazy blog post that should be thrown out after a minute of thought by anyone in the field.

OtherShrezzing · 6 months ago
Humans don't grow exponentially indefinitely. But there's only something in the order of 100k AI researchers employed in the big labs right now. Meanwhile, there's around 20mn software engineers globally, and around 200k math graduates per year.

The number of humans who could feasibly work on this problem is pretty high, and the labs could grow an order of magnitude, and still only be tapping into the top 1-2% of engineers & mathematicians. They could grow two orders of magnitude before they've absorbed all of the above-average engineers & mathematicians in the world.

smy20011 · 6 months ago
Human do write code that scalable with compute.

The performance is always raw performance * software efficiency. You can use shitty software and waste all these FLOPs.

alecco · 6 months ago
Algorithmic improvements in new fields are often bigger than hardware improvements.
stpedgwdgfhgdd · 6 months ago
Large amounts of teams are very hard to scale.

There is a reason why startups innovate and large companies follow.

mirekrusin · 6 months ago
Deepseek innovation is applicable to xAI setup - results are simply multiply of their compute scale.

Deepseek didn’t have option A or B available, they only had extreme optimisation option to work with.

It’s weird that people present those two approaches as mutually exclusive ones.

PeterStuer · 6 months ago
It's not an either/or. Your hiring of talent is only limited by your GPU spend if you can't hire because you ran out of money.

In reality pushing the frontier on datacenters will tend to attract the best talent, not turn them away.

And in talent, it is the quality rather than the quantity that counts.

A 10x breakthrough in algorithm will compound with a 10x scaleout in compute, not hinder it.

I am a big fan of Deepseek, Meta and other open model groups. I also admire what the Grok team is doing, especially their astounding execution velocity.

And it seems like Grok 2 is scheduled to be opened as promised.

krainboltgreene · 6 months ago
Have fun hiring any talent after three years of advertising to students that all programming/data jobs are going to be obsolete.
smy20011 · 6 months ago
Not that simple, It could cause resource curse [1] for developers. Why optimize algorithm when you have nearly infinity resources? For deepseek, their constrains is one of the reason they achieve breakthrough. One of their contribution, fp8 training, is to find a way to train models with GPUs that limit fp32 performance due to export control.

[1]: https://www.investopedia.com/terms/r/resource-curse.asp#:~:t...

SamPatt · 6 months ago
R1 came out when Grok 3's training was still ongoing. They shared their techniques freely, so you would expect the next round of models to incorporate as many of those techniques as possible. The bump you would get from the extra compute occurs in the next cycle.

If Musk really can get 1 million GPUs and they incorporate some algorithmic improvements, it'll be exciting to see what comes out.

dogma1138 · 6 months ago
Deepseek didn’t seem to invest in talent as much as it did in smuggling restricted GPUs into China via 3rd countries.

Also not for nothing scaling compute x100 or even x1000 is much easier than scaling talent by x10 or even x2 since you don’t need workers you need discovery.

tw1984 · 6 months ago
Talent is not something you can just freely pick up from your local Walmart.
wordofx · 6 months ago
Deepseek was a crypto mining operation before they pivoted to AI. They have an insane amount of GPUs laying around. So we have no idea how much compute they have compared to xAI.
oskarkk · 6 months ago
Do you have any sources for that? When I searched "DeepSeek crypto mining" the first result was your comment, the other results were just about the wide tech market selloff after DeepSeek appeared (that also affected crypto). As far as I know, they had many GPUs because their parent company was using AI algorithms for trading for many years.

https://en.wikipedia.org/wiki/High-Flyer

miki123211 · 6 months ago
Crypto GPUs have nothing to do with AI GPUs.

Crypto mining is an embarassingly parallel problem, requiring little to no communication between GPUs. To a first approximation, in crypto, 10x-ing the amount of "cores" per GPU, 10x-ing the number of GPUs per rig and 10X-ing the number of rigs you own is basically equivalent. An infinite amount of extremely slow GPUs would do just as well as one infinitely fast GPU. This is why consumer GPUs are great for crypto.

AI is the opposite. In AI, you need extremely fast communication between GPUs. This means getting as much memory per GPU as possible (to make communication less necessary), and putting all the GPUs all in one datacenter.

Consumer GPUs, which were used for crypto, don't support the fast communication technologies needed for AI training, and they don't come in the 80gb memory versions that AI labs need. This is Nvidia's price differentiation strategy.

oskarkk · 6 months ago
> While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

I'm not sure if it's close to 100x more. xAI had 100K Nvidia H100s, while this is what SemiAnalysis writes about DeepSeek:

> We believe they have access to around 50,000 Hopper GPUs, which is not the same as 50,000 H100, as some have claimed. There are different variations of the H100 that Nvidia made in compliance to different regulations (H800, H20), with only the H20 being currently available to Chinese model providers today. Note that H800s have the same computational power as H100s, but lower network bandwidth.

> We believe DeepSeek has access to around 10,000 of these H800s and about 10,000 H100s. Furthermore they have orders for many more H20’s, with Nvidia having produced over 1 million of the China specific GPU in the last 9 months. These GPUs are shared between High-Flyer and DeepSeek and geographically distributed to an extent. They are used for trading, inference, training, and research. For more specific detailed analysis, please refer to our Accelerator Model.

> Our analysis shows that the total server CapEx for DeepSeek is ~$1.6B, with a considerable cost of $944M associated with operating such clusters. Similarly, all AI Labs and Hyperscalers have many more GPUs for various tasks including research and training then they they commit to an individual training run due to centralization of resources being a challenge. X.AI is unique as an AI lab with all their GPUs in 1 location.

https://semianalysis.com/2025/01/31/deepseek-debates/

I don't know how much slower are these GPUs that they have, but if they have 50K of them, that doesn't sound like 100x less compute to me. Also, a company that has N GPUs and trains AI on them for 2 months can achieve the same results as a company that has 2N GPUs and trains for 1 month. So DeepSeek could spend a longer time training to offset the fact that have less GPUs than competitors.

cma · 6 months ago
Having 50K of them isn't the same thing as 50K in one high bandwidth cluster, right? x.AI has all theirs so far in one connected cluster, and all of homogenous H100s right?
_giorgio_ · 6 months ago
Deepseek spent at least 1.5 billion on hardware.
rfoo · 6 months ago
I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model. Hope that xAI can make Grok 3 API available next week so I can run it against some private evaluations to see if it's really this good.

Another nit-pick: I don't think DeepSeek had 50k Hopper GPUs. Maybe they have 50k now after getting the world's attention and having national-sponsored grey market back them, but that 50k number is certainly dreamed-up. During the past year DeepSeek's intern recruitment ads always just mentioned "unlimited access to 10k A100s", suggesting that they may have very limited H100/H800s, and most of their research ideas were validated on smaller models on an Ampere cluster. The 10k A100 number matches with a cluster their parent hedge fund company announced a few years ago. All in all my estimation is they had more (maybe 20k) A100s, and single-digit thousands of H800s.

kgwgk · 6 months ago
> my estimation is they had more (maybe 20k) A100s, and single-digit thousands of H800s.

Their technical report on DeepSeek-V3 says that it "is trained on a cluster equipped with 2048 NVIDIA H800 GPUs." If they had even high-single-digit thousands of H800s they would have probably used more computing power instead of waiting almost two months.

riku_iki · 6 months ago
> I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model.

could that benchmark be simply leaked to training data as many others?

petesergeant · 6 months ago
The author has come up with their own, _unusual_ definition of the bitter lesson. In fact, as a direct quote, the original Is:

> Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.

eg: “the study of linguistics doesn’t help you build an LLM” or “you don’t need to know about chicken physiology to make a vision system that tells you how old a chicken is”

The author then uses a narrow and _unusual_ definition of what computation _means_, by saying it simply means access to fast chips, rather than the work you can perform on them, which would obviously include how efficiently you use them.

In short, this article misuses two terms to more simply say “looks like the scaling laws still work”.

viraptor · 6 months ago
This is a weird takeaway from the recent changes. Right now companies can scale because there's stupid amount of stupid money flowing into the AI craze, but that's going to end. Companies are already discovering the issues with monetising those systems. Sure, they can "let go" and burn the available cash, but the investors will come knocking finally. Since everyone figures out similar tech anyway, it's the people with most tech improvement experience that will be in the best position long term, while openai will be stuck trying to squeeze adverts and monitoring into their chat for cash flow.
az226 · 6 months ago
Until we see progress slowing down, I don’t see venture capital disappearing in the race to ASI and beyond.
podgorniy · 6 months ago
Adjustment: not progress, the hype.

People belive that LLM progress will become foundation of the future economy expansion the same way as microelectronics did. But for now there are few signs of that economic benefit from AI/LLM stuff. If one do math of what productivity increase tech should give in order to have positive ROI, one would be surprised how reality is far from feasibility of investments https://www.bruegel.org/working-paper/tension-between-explod.... Yes, anecdotally people tell stories how they can code twice/trice/ten times faster, or how they atumated their whole workflow or replaced support with LLM. But that's far not enough for AI investment feasibility in existing businesses (AI startups will flurish for a while on venture money). Also anecdotally there are many failed attempts to replace people with LLMs (like mcdonalds ordering which resulted in crazy orders).

So what we have is a hype on top of beliefs in progress as continious phenomena. But progress itself has slowed greately. Where are all breakthroughs which change our way of living? Pocket computers and consumer electronics (which is not a discovery rather an optimisation) and internet (also more about scaling than inventing) were the last. 3d printing, cancer treatment, robotics thought to be the new factors. Till AI/LLM. Now AIs/LLMs are the last resourt for believers in progress and technooptimists like Musk.

nickfromseattle · 6 months ago
Side question, let's say Grok is comparable in intelligence to other leading models. Will any serious business switch their default AI capabilities to Grok?
taf2 · 6 months ago
We might - I was testing it out on some salesforce apex code and it was doing a better job then o3 mini high at getting to job done for me…
resfirestar · 6 months ago
Grok 2's per-token prices are similar to GPT-4o, but since Grok tends to write longer responses than others, it can be significantly more expensive to use depending on the task. If xAI prices Grok 3 to compete with o1, not everyone is going to be lining up to use it even if it's a bit better than the competition. If that's how it goes, I'll be interested in the pricing for Grok 3 mini.
marcuschong · 6 months ago
I wouldn't for two reasons: - we've already tested OpenAI's GPT and Gemini a lot. Although they're not deterministic, we have used them enough to trust them and know the pitfalls. - Elon's example of Grok outputting 'X is the only place for real information' makes the model almost completely useless for text generation. Even more so than DeepSeek.
gusfoo · 6 months ago
> Side question, let's say Grok is comparable in intelligence to other leading models. Will any serious business switch their default AI capabilities to Grok?

Yes, I'd say so. Bear in mind that, outside of the Terminally Online, very few people would deliberately hobble their business by deliberately choosing an inferior product.

nickthegreek · 6 months ago
if the api is cheap enough.
ssijak · 6 months ago
Why not?
chank · 6 months ago
Because they already have something that works. Why switch if theres no advantage.
cowpig · 6 months ago
Because people want to protect their businesses?

I don't even think it makes sense to use the closed, API-access models for core business functions. Seems like an existential risk to be handing over that kind of information to a third party in the age of AI.

But handing it over to the guy who is simultaneously running for-profit companies that contract with the government and seizing control of the Treasury? To a guy who has a long history of SEC issues, and who was caught cheating at a video game just to convince everyone he's a genius? That just seems incredibly short-sighted.

GaggiX · 6 months ago
The bitter lesson is about the fact that general methods that leverage computation are ultimately the most effective. Grok 3 is not more general than DeepSeek or OpenAI models so mentioning the bitter lesson here doesn't make much sense, it's just the scaling law.