Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.
Specialized coding model coming "in a few weeks". I notice they didn't talk about coding performance very much today.
Agreed. I noticed a quick flyby of a bad “reasoning smell” in the baseball World Series simulation, though - it looks like it pulled some numbers from polymarket, reasoned a long time, and then came back with the polymarket number for the Dodgers but presented as its own. It was a really fast run through, so I may be wrong, but it reminds me that it’s useful to have skeptics on the safety teams of these frontier models.
That said, these are HUGE improvements. Providing we don’t have benchmark contamination, this should be a very popular daily driver.
On coding - 256k context is the only real bit of bad news. I would guess their v7 model will have longer context, especially if it’s better at video. Either way, I’m looking forward to trying it.
Either they overtook other LLMs by simply using more compute (which is reasonable to think as they have a lot of GPUs) or I'm willing to bet there is benchmark contamination. I don't think their engineering team came up with any better techniques than used in training other LLMs, and Elon has a history of making deceptive announcements.
Even if one does not have a positive view of Elon Musk, the catching up of Grok to the big three (Google, OpenAI, Anthropic) is incredible. They are now at the same level aproximately.
> Seems like it is indeed the new SOTA model, with significantly better scores than o3
It has been demonstrated for quite some time that censoring models results in drastically reduced scores. Sure, maybe prevent it from telling somehow how to build a bomb, but we've seen Grok 3 routinely side with progressive views despite having access to the worst of humanity (and its sponsor).
The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent design, too. I'm genuinely looking forward to trying this out.
EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.
I can understand how/that this works, but it still feels like a 'hack' to me. It still feels like the LLM's themselves are plateauing but the applications get better by running the LLM's deeper, longer, wider (and by adding 'non ai' tooling/logic at the edges).
But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.
This is exactly how human society scaled from the cavemen era to today. We didn't need to make our brains bigger in order to get to the modern industrial age - increasingly sophisticated tool use and organization was all we did.
It only mattered that human brains are just big enough to enable tool use and organization. It ceased to matter once our brains are past a certain threshold. I believed LLMs are past this threshold as well (it has not 100% matched human brain or ever will, but this doesn't matter.)
An individual LLM call might lack domain knowledge, context and might hallucinate. The solution is not to scale the individual LLM and hope the problems are solved, but to direct your query to a team of LLMs each playing a different role: planner, designer, coder, reviewer, customer rep, ... each working with their unique perspective & context.
I get that feeling too - the underlying tech has plateaued, but now they're brute force trading extra time and compute for better results. I don't know if that scale anything but, at best, linearly. Are we going to end up with 10,000 AI monkeys on 10,000 AI typewriters and a team of a dozen monkeys deciding which one's work they like the most?
Isn't that kinda why we have collaboration and get in room with colleagues to discuss ideas? i.e., thinking about different ideas, getting different perspectives, considering trade-offs in various approaches, etc. results in a better solution than just letting one person go off and try to solve it with their thoughts alone.
Not sure if that's a good parallel, but seems plausible.
I can’t help but call out that o1-pro was great, it rarely took more than five minutes and I was almost never dissatisfied with the results per the wait. I happily paid for o1-pro the entire time it was available. Now, o3-pro is a relative disaster, often taking over 20 minutes just to refuse to follow directions and gaslight people about files being available for download that don’t exist, or provide simplified answers after waiting 20 minutes. It’s worse than useless when it actively wastes users time. I don’t see myself ever trusting OpenAI again after this “pro” subscription fiasco. To go from a great model to then just take it away and force an objectively terrible replacement, is definitely going the wrong way, when everyone else is improving (Gemini 2.5, Claude code with opus, etc). I can’t believe meta would pay a premium to poach the OpenAI people responsible for this severe regression.
I mean, either they cheated on evals ala Llama4, or they have a paradigm that's currently best in class in at least a few standard evals. Both alternatives are possible, I suppose.
So the progress is basically to brute force even more?
We got from "single prompt, single output", to reasoning (simple brute-forcing) and now to multiple parallel instances of reasoning (distributed brute-forcing)?
No wonder the prices are increasing and capacity is more limited.
I just tried Grok 4 and it's insanely good. I was able to generate 1,000 lines of Java CDK code responsible for setting up an EC2 instance with certain pre-installed software. Grok produced all the code in one iteration. 1,000 lines of code, including VPC, Security Groups, etc. Zero syntax errors! Most importantly, it generated userData (#!/bin/bash commands) with accurate `wget` pointing to valid URLs of the latest software artifacts on GitHub. Insane!
The problem is that code as a 1-off is excellent, but as a maintainable piece of code that needs to be in source control, shared across teams, follow standard SLDC, be immutable, and track changes in some state - it's just not there.
If an intern handed me code like this to deploy an EC2 instance in production, I would need to have a long discussion about their decisions.
Please share your result if possible. So many lines in a single shot with no errors would indeed be impressive. Does grok run tools for these sorts of queries? (linters/sandbox execution/web search)
The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they'll keep decreasing. It feels like a lot of these companies do not have enough GPUs which is a problem Google likely does not have.
I can already use Gemini 2.5 Pro for free in AI studio. Crazier still, I can even set the thinking budget to a whopping 32k and still not pay a dime. Maybe Gemini 3.0 will be available for free as well.
Who promised that there would be no advanced models with high costs?
Prices for the same number of tokens at the level of capability an are falling. But just like Moore’s law most certainly did NOT say that chips would get no more complex than the 1103 1kb DRAM but would shrink from 10mm^2 to a speck far too small to see.
If it could write like George Will or Thomas Sowell or Fred Hayek or even William Loeb that would be one thing. But it hears dog whistles and barks which makes it a dog. Except a real dog is soft and has a warm breath, knows your scent, is genuinely happy when you come home and will take a chomp out of the leg of anyone who invades your home at night.
where Grok exhibited the kind of behavior that puts "degenerate" in "degenerate behavior". Why do people expect anything more? Ten years ago you could be a conservative with a conscience -- now if you are you start The Bulwark.
> The most expensive computer is a lot more expensive than the first PC.
Not if you're only looking at modern PCs (and adjusting for inflation). It seems unfair to compare a computer built for a data center with tens of thousands in GPUs to a PC from back then as opposed to a mainframe.
> The most expensive computer is a lot more expensive than the first PC.
Depends on your definition of "computer". If you mean the most expensive modern PC I think you're way off. From https://en.wikipedia.org/wiki/Xerox_Alto: "The Xerox Alto [...] is considered one of the first workstations or personal computers", "Introductory price US$32,000 (equivalent to $139,000 in 2024)".
The base model Apple II cost ~$1300USD when it was released; that's ~$7000USD today inflation adjusted.
In other words, Apple sells one base-model computer today that is more expensive than the Apple II; the Mac Pro. They sell a dozen other computers that are significantly cheaper.
That was the most predictable outcome. It's like we learned nothing from Netflix, nor the general enshittification of tech by the end of the 2010's. We'll have the billionaire AI tech capture markets and charge enterprise prices to make pay back investors. Then maybe we'll have a few free/cheap models fighting over the scraps.
Those small creators hoping to leverage AI to bring their visions to life for less than their grocery bill will have a rude awakening. That's why I never liked the argument of "but it saves me money on hiring real people".
I heard some small chinese shops for mobile games were already having this problem in recent years and had to re-hire their human labor back when costs started rising.
I'm honestly impressed that the sutro team could write a whole post complaining about Flash, and not once mention that Flash was actually 2 different models, and even go further to compare the price of Flash non-thinking to Flash Thinking. The team is either scarily incompetent, or purposely misleading.
Google replaced flash non-thinking with Flash-lite. It rebalanced the cost of flash thinking.
also their api pricing is a little misleading - it only matches sonnet 4 pricing ($3/$15) only "for request under 128k" (whatever it means) but above that it's 2x more.
That 128k is a reference to the context window — how many tokens you put in to the start. Presumably Grok 4 with 128k context window is running on less hardware (it needs much less RAM than 256k) and they route it accordingly internally.
Why number of GPUs is the problem and not the amount of GPUs usage? I don't think buying GPUs is the problem, but if you have tons of GPUs it can be very expensive. I presume that's the reason it's so expensive, especially with LLMs.
> These prices seem to keep increasing while we were promised they'll keep decreasin
I don't remeber anyone promising that, but whoever promised you that, in some period of time which includes our current present, frontier public model pricing would be monotonically decreasing was either lting or badly misguided. While there will be short term deviations, the overall arc for that will continue be upward.
OTOH, the models available at any given price point will also radically improve, to the point where you can follow a curve of both increasing quality and decreasing price, so long as you don't want a model at the quality frontier.
O3 was just reduced in price by 80%. Grok4 is a pretty good deal for having just been released and being so much better. The token price is the same as grok 3 for the not heavy model. Google is loosing money to try and gain relevance. I guess i’m not sure what your point is?
While Google is so explicit about that, I have a good reason to believe that this actually happens in most if not all massive LLM services. I think Google's free offerings are more about vendor lock-in, a common Google tactic.
Not a junior engineer in a developed country, but what was previously an offshore junior engineer tasked with doing the repetitive labor too costly for western labor.
Grok's updated voice mode is indeed impressive. I wish there was a way to disable automatic turn detection, so that it wouldn't treat silence as an end of the response. I like Claude's approach (you need to tap in order to end the response), but it's not very reliable because sometimes it just abruptly cuts my response without waiting until I tap.
I was pleasantly surprised that Grok even supports (to some degree) Lithuanian in voice mode, which is a quite niche language. Grok's responses themselves are alright, but ChatGPT and Gemini way surpass it in speech recognition and speech synthesis.
> Grok's updated voice mode is indeed impressive. I wish there was a way to disable automatic turn detection, so that it wouldn't treat silence as an end of the response.
You can circumvent that by instructing the model to use "radio etiquette" - only respond after the other part says "over". It will still be compelled to answer when it detects silence, you can't prevent that, but you can instruct it to only reply with a short "mhm" until you say "over". Feels very natural.
Like most models I've used with this old hack, it will immediately start role-playing and also end its own responses with "over".
This is such a cool idea. I wonder whether it's possible to define a custom Personality in Grok's voice settings that would do this. Unfortunately I'm not able to create a new Personality in Grok's settings to test this right now on my phone (iPhone 15 Pro Max), because the Personality creation screen closes immediately after opening it. Might be a bug or some other issue.
yes their voice mode is pretty good also works with Polish (much better than few months ago). I wish they had also option 'push to talk' (walkie talkie style with big button) similar like perplexity allow such mode or 'automatic'.
Also would be great if they added voice mode in browser (again like perplexity).
> Also would be great if they added voice mode in browser
There seems to be a voice mode button in the prompt input box at ~29:00 of the Grok 4 announcement video.
So perhaps they're working on this, but it's hidden from the public.
I find for auto turn detection, models work better if you put in the system prompt "if it seems the user hasnt completed their thought yet, output silence". This hack works around their compulsive need to output something.
I feel like they should train a dumb model that does nothing but recognize when someone has finished talking, and use that to determine when to stop listening and start responding. Maybe it could even run on the phone?
Lithuanian sounds so weird on ChatGPT tho, almost like my kids speak - with sort of english accent. Regardless it gives my parents superpower (when it actually works hehe).
Grok's Twitter integration has legitimately been one of the best use cases I've seen. Just being able to ask Grok right within the tweet about context or meaning of any jargon is very useful.
I think the Grok button that is present on tweets is the best way to ask Grok about tweets. Tagging @grok just spams others' timelines with useless AI responses. The Grok button lets you keep it private.
It's an agentic research mode, grounded with links from the web that reduce or eliminate hallucinations. The results is a very detailed, sometimes 50 page output. Very useful if you're trying to understand a new industry, state-of-the-art on a tech etc.
Out of interest, has anyone ever integrated with Grok? I've done so many LLM integrations in the last few years, but never heard of anyone choosing Grok. I feel like they are going to need an unmistakably capable model before anyone would want to risk it - they don't behave like a serious company.
Grok 3 is on Azure AI Foundary [0] and announced an integration with Telegram, albeit they are paying Telegram $300m not vice versa [1]. But I agree, choosing Grok is just a huge reputational liability for anyone’s work that is serious.
Any plans for GCP Vertex AI or AWS Bedrock? Apparently Grok 3 had the highest score for Golang on roocode.com/evals so I’d like to try it for coding. The free tier app hasn’t been bad either, I like it’s attitude a bit better than ChatGPT.
There is so much money and so many top labs falling over themselves to attract good talent, that at this point people have to be leaning on ideological goals to choose their employer.
Are there really that many AI researchers who want to make Elon god-emperor?
I read the last election and other signals as the idea that there's way more unspoken diversity of thought in peoples minds than what people feel safe to say. Secretly lots of top talent probably doesn't care or even aligns with elon but chooses to say so at most with their actions in the form of being ok working for him.
A lot of serious engineers would love to work in an environment that isn't the HR-reigning office politics bullshit standard of the past decade or two.
I don't even really like Elon but I bet the engineers at X are having a better time in their day-to-day than the ones at Meta or Google where all their work is constantly roadblocked by red tape, in-fighting, and PMs whose only goal is to make it look like they headed something important to get themselves promoted. Elon's at least got a vision and keeps it a top priority to be competitive in the AI space.
I also feel Elon's team has been "untouchable" for Zuck and doesn't want to stir anything with him. But since his falling out of grace with the admin that could change?
If you're focusing on ideology, it isn't like the other companies are all that good. With Sam Altman you're still working for a pathological liar with delusions of grandeur. With Google and Meta you're propping up a massive worldwide surveillance apparatus.
Tech-bros have been propping up agents/propagators of some of the biggest social ills of the past ~2 decades, xAI isn't all that different.
You would have to be insane to integrate the model that last week called itself "Mecha Hitler" into your live product.
As a huge Musk fan i'll be the first to point out how he's doing exactly what he accused Sama of doing; making powerful ai with an obvious lack of control or effective alignment.
Specialized coding model coming "in a few weeks". I notice they didn't talk about coding performance very much today.
That said, these are HUGE improvements. Providing we don’t have benchmark contamination, this should be a very popular daily driver.
On coding - 256k context is the only real bit of bad news. I would guess their v7 model will have longer context, especially if it’s better at video. Either way, I’m looking forward to trying it.
Dead Comment
It has been demonstrated for quite some time that censoring models results in drastically reduced scores. Sure, maybe prevent it from telling somehow how to build a bomb, but we've seen Grok 3 routinely side with progressive views despite having access to the worst of humanity (and its sponsor).
EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.
But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.
It only mattered that human brains are just big enough to enable tool use and organization. It ceased to matter once our brains are past a certain threshold. I believed LLMs are past this threshold as well (it has not 100% matched human brain or ever will, but this doesn't matter.)
An individual LLM call might lack domain knowledge, context and might hallucinate. The solution is not to scale the individual LLM and hope the problems are solved, but to direct your query to a team of LLMs each playing a different role: planner, designer, coder, reviewer, customer rep, ... each working with their unique perspective & context.
Pointy sticks and ASML's EUV machines were designed by roughly the same lumps of compute-fat :)
Not sure if that's a good parallel, but seems plausible.
Yes, but... in order to train your next SotA model you have to do this anyway and do rejection sampling to generate good synthetic data.
So if you can do it in prod for users paying 300$/month, it's a pretty good deal.
https://x.com/karpathy/status/1870692546969735361
https://github.com/irthomasthomas/llm-consortium
Myself, I'm looking forward to trying it out when companies with less, um, baggage implement the same. (I have principles I try to maintain.)
Deleted Comment
Dead Comment
We got from "single prompt, single output", to reasoning (simple brute-forcing) and now to multiple parallel instances of reasoning (distributed brute-forcing)?
No wonder the prices are increasing and capacity is more limited.
Impressive. /s
Deleted Comment
If an intern handed me code like this to deploy an EC2 instance in production, I would need to have a long discussion about their decisions.
How do you know the criteria you mention hasn't (or can't) be factored into any prompt and context tuning?
How do you know that all the criteria that was important in the pre-llm world still has the same priority as their capabilities increase?
"This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."
https://x.com/arcprize/status/1943168950763950555
I can already use Gemini 2.5 Pro for free in AI studio. Crazier still, I can even set the thinking budget to a whopping 32k and still not pay a dime. Maybe Gemini 3.0 will be available for free as well.
Prices for the same number of tokens at the level of capability an are falling. But just like Moore’s law most certainly did NOT say that chips would get no more complex than the 1103 1kb DRAM but would shrink from 10mm^2 to a speck far too small to see.
A Ferrari is more expensive than the model T.
The most expensive computer is a lot more expensive than the first PC.
The price that usually falls is:
* The entry level. * The same performance over time.
But the _price range_ gets wider. That's fine. That's a sign of maturity.
The only difference this time is that the entry level was artificially 0 (or very low) because of VC funding.
If it could write like George Will or Thomas Sowell or Fred Hayek or even William Loeb that would be one thing. But it hears dog whistles and barks which makes it a dog. Except a real dog is soft and has a warm breath, knows your scent, is genuinely happy when you come home and will take a chomp out of the leg of anyone who invades your home at night.
We are also getting this kind of discussion
https://news.ycombinator.com/item?id=44502981
where Grok exhibited the kind of behavior that puts "degenerate" in "degenerate behavior". Why do people expect anything more? Ten years ago you could be a conservative with a conscience -- now if you are you start The Bulwark.
Not if you're only looking at modern PCs (and adjusting for inflation). It seems unfair to compare a computer built for a data center with tens of thousands in GPUs to a PC from back then as opposed to a mainframe.
Depends on your definition of "computer". If you mean the most expensive modern PC I think you're way off. From https://en.wikipedia.org/wiki/Xerox_Alto: "The Xerox Alto [...] is considered one of the first workstations or personal computers", "Introductory price US$32,000 (equivalent to $139,000 in 2024)".
In other words, Apple sells one base-model computer today that is more expensive than the Apple II; the Mac Pro. They sell a dozen other computers that are significantly cheaper.
Those small creators hoping to leverage AI to bring their visions to life for less than their grocery bill will have a rude awakening. That's why I never liked the argument of "but it saves me money on hiring real people".
I heard some small chinese shops for mobile games were already having this problem in recent years and had to re-hire their human labor back when costs started rising.
https://news.ycombinator.com/item?id=44457371
Google replaced flash non-thinking with Flash-lite. It rebalanced the cost of flash thinking.
The vast majority of the world can’t afford 100s of dollars a month
Well, valuations keep increasing, they have to make the calculations work somehow.
I don't remeber anyone promising that, but whoever promised you that, in some period of time which includes our current present, frontier public model pricing would be monotonically decreasing was either lting or badly misguided. While there will be short term deviations, the overall arc for that will continue be upward.
OTOH, the models available at any given price point will also radically improve, to the point where you can follow a curve of both increasing quality and decreasing price, so long as you don't want a model at the quality frontier.
Aren't they all stil losing money, regardless?
Like the other AI companies, they will want to sign up companies.
It is Google. So, I'd pay attention to data collection feeding back in to training or evaluation.
https://news.ycombinator.com/item?id=44379036
I was pleasantly surprised that Grok even supports (to some degree) Lithuanian in voice mode, which is a quite niche language. Grok's responses themselves are alright, but ChatGPT and Gemini way surpass it in speech recognition and speech synthesis.
You can circumvent that by instructing the model to use "radio etiquette" - only respond after the other part says "over". It will still be compelled to answer when it detects silence, you can't prevent that, but you can instruct it to only reply with a short "mhm" until you say "over". Feels very natural.
Like most models I've used with this old hack, it will immediately start role-playing and also end its own responses with "over".
Also would be great if they added voice mode in browser (again like perplexity).
There seems to be a voice mode button in the prompt input box at ~29:00 of the Grok 4 announcement video. So perhaps they're working on this, but it's hidden from the public.
I hope that can be turned off while driving...
Hope FB brings something like this tho. Might be especially useful to summarize/search big groups.
People used to cry how private groups and slack killed forums and hidden info, but I think we have a chance with tools like this.
The only two areas I've found Grok to be the best at are real time updates and IT support questions.
Can you say what you mean by deep research?
https://x.ai/news/grok-3#grok-agents-combining-reasoning-and...
[0] https://devblogs.microsoft.com/foundry/announcing-grok-3-and... [1] https://www.bbc.co.uk/news/articles/cdxvr3n7wlxo
There is so much money and so many top labs falling over themselves to attract good talent, that at this point people have to be leaning on ideological goals to choose their employer.
Are there really that many AI researchers who want to make Elon god-emperor?
I don't even really like Elon but I bet the engineers at X are having a better time in their day-to-day than the ones at Meta or Google where all their work is constantly roadblocked by red tape, in-fighting, and PMs whose only goal is to make it look like they headed something important to get themselves promoted. Elon's at least got a vision and keeps it a top priority to be competitive in the AI space.
Tech-bros have been propping up agents/propagators of some of the biggest social ills of the past ~2 decades, xAI isn't all that different.
As a huge Musk fan i'll be the first to point out how he's doing exactly what he accused Sama of doing; making powerful ai with an obvious lack of control or effective alignment.
Dead Comment