Readit News logoReadit News
modeless · a month ago
Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.

Specialized coding model coming "in a few weeks". I notice they didn't talk about coding performance very much today.

vessenes · a month ago
Agreed. I noticed a quick flyby of a bad “reasoning smell” in the baseball World Series simulation, though - it looks like it pulled some numbers from polymarket, reasoned a long time, and then came back with the polymarket number for the Dodgers but presented as its own. It was a really fast run through, so I may be wrong, but it reminds me that it’s useful to have skeptics on the safety teams of these frontier models.

That said, these are HUGE improvements. Providing we don’t have benchmark contamination, this should be a very popular daily driver.

On coding - 256k context is the only real bit of bad news. I would guess their v7 model will have longer context, especially if it’s better at video. Either way, I’m looking forward to trying it.

dbagr · a month ago
Either they overtook other LLMs by simply using more compute (which is reasonable to think as they have a lot of GPUs) or I'm willing to bet there is benchmark contamination. I don't think their engineering team came up with any better techniques than used in training other LLMs, and Elon has a history of making deceptive announcements.
esafak · a month ago
I wish the coding models were available in coding agents. Haven't seem them anywhere.
vincent_s · a month ago
Grok 4 is now available in Cursor.
justarobert · a month ago
Plenty like Aider and Cline can connect to pretty much any model with an API.
Squarex · a month ago
Even if one does not have a positive view of Elon Musk, the catching up of Grok to the big three (Google, OpenAI, Anthropic) is incredible. They are now at the same level aproximately.

Dead Comment

zamalek · a month ago
> Seems like it is indeed the new SOTA model, with significantly better scores than o3

It has been demonstrated for quite some time that censoring models results in drastically reduced scores. Sure, maybe prevent it from telling somehow how to build a bomb, but we've seen Grok 3 routinely side with progressive views despite having access to the worst of humanity (and its sponsor).

fdsjgfklsfd · a month ago
Wait, are you implying that Grok 3 is "censored" because it aligns with "progressive" views?
tibbar · a month ago
The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent design, too. I'm genuinely looking forward to trying this out.

EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.

icoder · a month ago
I can understand how/that this works, but it still feels like a 'hack' to me. It still feels like the LLM's themselves are plateauing but the applications get better by running the LLM's deeper, longer, wider (and by adding 'non ai' tooling/logic at the edges).

But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.

crazylogger · a month ago
This is exactly how human society scaled from the cavemen era to today. We didn't need to make our brains bigger in order to get to the modern industrial age - increasingly sophisticated tool use and organization was all we did.

It only mattered that human brains are just big enough to enable tool use and organization. It ceased to matter once our brains are past a certain threshold. I believed LLMs are past this threshold as well (it has not 100% matched human brain or ever will, but this doesn't matter.)

An individual LLM call might lack domain knowledge, context and might hallucinate. The solution is not to scale the individual LLM and hope the problems are solved, but to direct your query to a team of LLMs each playing a different role: planner, designer, coder, reviewer, customer rep, ... each working with their unique perspective & context.

SketchySeaBeast · a month ago
I get that feeling too - the underlying tech has plateaued, but now they're brute force trading extra time and compute for better results. I don't know if that scale anything but, at best, linearly. Are we going to end up with 10,000 AI monkeys on 10,000 AI typewriters and a team of a dozen monkeys deciding which one's work they like the most?
the8472 · a month ago
grug think man-think also plateau, but get better with tool and more tribework

Pointy sticks and ASML's EUV machines were designed by roughly the same lumps of compute-fat :)

billti · a month ago
Isn't that kinda why we have collaboration and get in room with colleagues to discuss ideas? i.e., thinking about different ideas, getting different perspectives, considering trade-offs in various approaches, etc. results in a better solution than just letting one person go off and try to solve it with their thoughts alone.

Not sure if that's a good parallel, but seems plausible.

cfn · a month ago
Maybe this is the dawn of the multicore era for LLMs.
qoez · a month ago
It's basically a mixture of experts but instead of a learned operator picking the predicted best model, you use a 'max' operator across all experts.
simondotau · a month ago
You could argue that many aspects of human cognition are "hacks" too.
Voloskaya · a month ago
> Expensive and slow

Yes, but... in order to train your next SotA model you have to do this anyway and do rejection sampling to generate good synthetic data.

So if you can do it in prod for users paying 300$/month, it's a pretty good deal.

daniel_iversen · a month ago
Very clever, thanks for mentioning this!
simianwords · a month ago
that's how o3 pro also works IMO
bobjordan · a month ago
I can’t help but call out that o1-pro was great, it rarely took more than five minutes and I was almost never dissatisfied with the results per the wait. I happily paid for o1-pro the entire time it was available. Now, o3-pro is a relative disaster, often taking over 20 minutes just to refuse to follow directions and gaslight people about files being available for download that don’t exist, or provide simplified answers after waiting 20 minutes. It’s worse than useless when it actively wastes users time. I don’t see myself ever trusting OpenAI again after this “pro” subscription fiasco. To go from a great model to then just take it away and force an objectively terrible replacement, is definitely going the wrong way, when everyone else is improving (Gemini 2.5, Claude code with opus, etc). I can’t believe meta would pay a premium to poach the OpenAI people responsible for this severe regression.
zone411 · a month ago
This is the speculation, but then it wouldn't have to take much longer to answer than o3.
tibbar · a month ago
Interesting. I'd guess this technique should probably work with any SOTA model in an agentic tool loop. Fun!
JKCalhoun · a month ago
> I'm genuinely looking forward to trying this out.

Myself, I'm looking forward to trying it out when companies with less, um, baggage implement the same. (I have principles I try to maintain.)

Deleted Comment

nisegami · a month ago
I've suspected that technique could work on mitigating hallucinations, where other agents could call bullshit on a made up source.
sidibe · a month ago
You are making the mistake of taking one of Elon's presentations at face value.
tibbar · a month ago
I mean, either they cheated on evals ala Llama4, or they have a paradigm that's currently best in class in at least a few standard evals. Both alternatives are possible, I suppose.

Dead Comment

einrealist · a month ago
So the progress is basically to brute force even more?

We got from "single prompt, single output", to reasoning (simple brute-forcing) and now to multiple parallel instances of reasoning (distributed brute-forcing)?

No wonder the prices are increasing and capacity is more limited.

Impressive. /s

Deleted Comment

andreygrehov · a month ago
I just tried Grok 4 and it's insanely good. I was able to generate 1,000 lines of Java CDK code responsible for setting up an EC2 instance with certain pre-installed software. Grok produced all the code in one iteration. 1,000 lines of code, including VPC, Security Groups, etc. Zero syntax errors! Most importantly, it generated userData (#!/bin/bash commands) with accurate `wget` pointing to valid URLs of the latest software artifacts on GitHub. Insane!
sudo-i · a month ago
The problem is that code as a 1-off is excellent, but as a maintainable piece of code that needs to be in source control, shared across teams, follow standard SLDC, be immutable, and track changes in some state - it's just not there.

If an intern handed me code like this to deploy an EC2 instance in production, I would need to have a long discussion about their decisions.

mellosouls · a month ago
How do you know without seeing the code?

How do you know the criteria you mention hasn't (or can't) be factored into any prompt and context tuning?

How do you know that all the criteria that was important in the pre-llm world still has the same priority as their capabilities increase?

nlarew · a month ago
How do you know? Have you seen the code GP generated?
kvirani · a month ago
But isn't that just a few refactoring prompts away?
nashadelic · a month ago
I'd love to hear how grok works inside agentic coders like cursor or copilot for production code bases.
doctoboggan · a month ago
Please share your result if possible. So many lines in a single shot with no errors would indeed be impressive. Does grok run tools for these sorts of queries? (linters/sandbox execution/web search)
makestuff · a month ago
Out of curiosity, why do you use Java instead of typescript for CDK? Just to keep everything in one language?
oblio · a month ago
Why not, I would say? What's the advantage of using Typescript over modern Java?
z7 · a month ago
"Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%."

"This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."

https://x.com/arcprize/status/1943168950763950555

SilverSlash · a month ago
The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they'll keep decreasing. It feels like a lot of these companies do not have enough GPUs which is a problem Google likely does not have.

I can already use Gemini 2.5 Pro for free in AI studio. Crazier still, I can even set the thinking budget to a whopping 32k and still not pay a dime. Maybe Gemini 3.0 will be available for free as well.

brookst · a month ago
Who promised that there would be no advanced models with high costs?

Prices for the same number of tokens at the level of capability an are falling. But just like Moore’s law most certainly did NOT say that chips would get no more complex than the 1103 1kb DRAM but would shrink from 10mm^2 to a speck far too small to see.

serbuvlad · a month ago
> These prices seem to keep increasing while we were promised they'll keep decreasing.

A Ferrari is more expensive than the model T.

The most expensive computer is a lot more expensive than the first PC.

The price that usually falls is:

* The entry level. * The same performance over time.

But the _price range_ gets wider. That's fine. That's a sign of maturity.

The only difference this time is that the entry level was artificially 0 (or very low) because of VC funding.

PaulHoule · a month ago
But where is the value?

If it could write like George Will or Thomas Sowell or Fred Hayek or even William Loeb that would be one thing. But it hears dog whistles and barks which makes it a dog. Except a real dog is soft and has a warm breath, knows your scent, is genuinely happy when you come home and will take a chomp out of the leg of anyone who invades your home at night.

We are also getting this kind of discussion

https://news.ycombinator.com/item?id=44502981

where Grok exhibited the kind of behavior that puts "degenerate" in "degenerate behavior". Why do people expect anything more? Ten years ago you could be a conservative with a conscience -- now if you are you start The Bulwark.

HWR_14 · a month ago
> The most expensive computer is a lot more expensive than the first PC.

Not if you're only looking at modern PCs (and adjusting for inflation). It seems unfair to compare a computer built for a data center with tens of thousands in GPUs to a PC from back then as opposed to a mainframe.

mkl · a month ago
> The most expensive computer is a lot more expensive than the first PC.

Depends on your definition of "computer". If you mean the most expensive modern PC I think you're way off. From https://en.wikipedia.org/wiki/Xerox_Alto: "The Xerox Alto [...] is considered one of the first workstations or personal computers", "Introductory price US$32,000 (equivalent to $139,000 in 2024)".

827a · a month ago
The base model Apple II cost ~$1300USD when it was released; that's ~$7000USD today inflation adjusted.

In other words, Apple sells one base-model computer today that is more expensive than the Apple II; the Mac Pro. They sell a dozen other computers that are significantly cheaper.

johnnyanmac · a month ago
That was the most predictable outcome. It's like we learned nothing from Netflix, nor the general enshittification of tech by the end of the 2010's. We'll have the billionaire AI tech capture markets and charge enterprise prices to make pay back investors. Then maybe we'll have a few free/cheap models fighting over the scraps.

Those small creators hoping to leverage AI to bring their visions to life for less than their grocery bill will have a rude awakening. That's why I never liked the argument of "but it saves me money on hiring real people".

I heard some small chinese shops for mobile games were already having this problem in recent years and had to re-hire their human labor back when costs started rising.

altbdoor · a month ago
It's important to note that pricing for Gemini has been increasing too.

https://news.ycombinator.com/item?id=44457371

Workaccount2 · a month ago
I'm honestly impressed that the sutro team could write a whole post complaining about Flash, and not once mention that Flash was actually 2 different models, and even go further to compare the price of Flash non-thinking to Flash Thinking. The team is either scarily incompetent, or purposely misleading.

Google replaced flash non-thinking with Flash-lite. It rebalanced the cost of flash thinking.

CamperBob2 · a month ago
Also important to note that Gemini has gotten a lot slower, just over the past few weeks.
Havoc · a month ago
It’s the inference time scaling - this is going to create a whole new level of have vs have nots split.

The vast majority of the world can’t afford 100s of dollars a month

johnb231 · a month ago
That is for professional or commercial use, not casual home users.
pzo · a month ago
also their api pricing is a little misleading - it only matches sonnet 4 pricing ($3/$15) only "for request under 128k" (whatever it means) but above that it's 2x more.
vessenes · a month ago
That 128k is a reference to the context window — how many tokens you put in to the start. Presumably Grok 4 with 128k context window is running on less hardware (it needs much less RAM than 256k) and they route it accordingly internally.
XCSme · a month ago
> These prices seem to keep increasing

Well, valuations keep increasing, they have to make the calculations work somehow.

worldsavior · a month ago
Why number of GPUs is the problem and not the amount of GPUs usage? I don't think buying GPUs is the problem, but if you have tons of GPUs it can be very expensive. I presume that's the reason it's so expensive, especially with LLMs.
dragonwriter · a month ago
> These prices seem to keep increasing while we were promised they'll keep decreasin

I don't remeber anyone promising that, but whoever promised you that, in some period of time which includes our current present, frontier public model pricing would be monotonically decreasing was either lting or badly misguided. While there will be short term deviations, the overall arc for that will continue be upward.

OTOH, the models available at any given price point will also radically improve, to the point where you can follow a curve of both increasing quality and decreasing price, so long as you don't want a model at the quality frontier.

42lux · a month ago
It's because a lot of the advancements are post training the models themselves have stagnated. Look at the heavy "model"...
oblio · a month ago
> These prices seem to keep increasing while we were promised they'll keep decreasing.

Aren't they all stil losing money, regardless?

briandw · a month ago
O3 was just reduced in price by 80%. Grok4 is a pretty good deal for having just been released and being so much better. The token price is the same as grok 3 for the not heavy model. Google is loosing money to try and gain relevance. I guess i’m not sure what your point is?
v5v3 · a month ago
You have to have a high RRP to negotiate any volume deals down from.

Like the other AI companies, they will want to sign up companies.

ignoramous · a month ago
> Gemini 2.5 Pro for free ...

It is Google. So, I'd pay attention to data collection feeding back in to training or evaluation.

https://news.ycombinator.com/item?id=44379036

lifthrasiir · a month ago
While Google is so explicit about that, I have a good reason to believe that this actually happens in most if not all massive LLM services. I think Google's free offerings are more about vendor lock-in, a common Google tactic.
ljlolel · a month ago
More of an issue of market share than # of gpus?
sim7c00 · a month ago
money money money, its a rich mans world...
greatpostman · a month ago
300 a month is cheap for what is basically a junior engineer
FirmwareBurner · a month ago
Not a junior engineer in a developed country, but what was previously an offshore junior engineer tasked with doing the repetitive labor too costly for western labor.
handfuloflight · a month ago
It's a senior engineer when maneuvered by a senior engineer.
rpozarickij · a month ago
Grok's updated voice mode is indeed impressive. I wish there was a way to disable automatic turn detection, so that it wouldn't treat silence as an end of the response. I like Claude's approach (you need to tap in order to end the response), but it's not very reliable because sometimes it just abruptly cuts my response without waiting until I tap.

I was pleasantly surprised that Grok even supports (to some degree) Lithuanian in voice mode, which is a quite niche language. Grok's responses themselves are alright, but ChatGPT and Gemini way surpass it in speech recognition and speech synthesis.

pbmonster · a month ago
> Grok's updated voice mode is indeed impressive. I wish there was a way to disable automatic turn detection, so that it wouldn't treat silence as an end of the response.

You can circumvent that by instructing the model to use "radio etiquette" - only respond after the other part says "over". It will still be compelled to answer when it detects silence, you can't prevent that, but you can instruct it to only reply with a short "mhm" until you say "over". Feels very natural.

Like most models I've used with this old hack, it will immediately start role-playing and also end its own responses with "over".

rpozarickij · a month ago
This is such a cool idea. I wonder whether it's possible to define a custom Personality in Grok's voice settings that would do this. Unfortunately I'm not able to create a new Personality in Grok's settings to test this right now on my phone (iPhone 15 Pro Max), because the Personality creation screen closes immediately after opening it. Might be a bug or some other issue.
nashadelic · a month ago
this is such a great, obvious(?) idea, I've always hated feeling "rushed" whenever I talk to a voice agent and doesn't give me enough time to think.
pzo · a month ago
yes their voice mode is pretty good also works with Polish (much better than few months ago). I wish they had also option 'push to talk' (walkie talkie style with big button) similar like perplexity allow such mode or 'automatic'.

Also would be great if they added voice mode in browser (again like perplexity).

rpozarickij · a month ago
> Also would be great if they added voice mode in browser

There seems to be a voice mode button in the prompt input box at ~29:00 of the Grok 4 announcement video. So perhaps they're working on this, but it's hidden from the public.

stormfather · a month ago
I find for auto turn detection, models work better if you put in the system prompt "if it seems the user hasnt completed their thought yet, output silence". This hack works around their compulsive need to output something.
bfelbo · a month ago
Great hack, thanks for sharing! Any other hacks like this you’ve found useful to improve voice AI?
bilsbie · a month ago
Even better if you can just use umm’s like in a human conversation.
fdsjgfklsfd · a month ago
I feel like they should train a dumb model that does nothing but recognize when someone has finished talking, and use that to determine when to stop listening and start responding. Maybe it could even run on the phone?
dzhiurgis · a month ago
Lithuanian sounds so weird on ChatGPT tho, almost like my kids speak - with sort of english accent. Regardless it gives my parents superpower (when it actually works hehe).
fdsjgfklsfd · a month ago
> you need to tap in order to end the response

I hope that can be turned off while driving...

raspasov · a month ago
Grok has consistently been one of the best models I've used for deep research (no API use). Grok 4 looks even more promising.
spaceman_2020 · a month ago
Grok's Twitter integration has legitimately been one of the best use cases I've seen. Just being able to ask Grok right within the tweet about context or meaning of any jargon is very useful.
saagarjha · a month ago
@grok is this true?
LorenDB · a month ago
I think the Grok button that is present on tweets is the best way to ask Grok about tweets. Tagging @grok just spams others' timelines with useless AI responses. The Grok button lets you keep it private.
archagon · a month ago
Particularly useful if you’re an antisemite or white supremacist, it seems.
dzhiurgis · a month ago
It still struggles to grok large threads.

Hope FB brings something like this tho. Might be especially useful to summarize/search big groups.

People used to cry how private groups and slack killed forums and hidden info, but I think we have a chance with tools like this.

v5v3 · a month ago
@AskPerplexity is also on x
CSMastermind · a month ago
I'm surprised by this, OpenAI does much better for me than all the competitors (though I wouldn't consider it good).

The only two areas I've found Grok to be the best at are real time updates and IT support questions.

FirmwareBurner · a month ago
> deep research

Can you say what you mean by deep research?

repsak · a month ago
Agent that browses the web, analyzes information, and creates reports. Grok calls it DeepSearch. Similar to gemini/openai deep research.

https://x.ai/news/grok-3#grok-agents-combining-reasoning-and...

nashadelic · a month ago
It's an agentic research mode, grounded with links from the web that reduce or eliminate hallucinations. The results is a very detailed, sometimes 50 page output. Very useful if you're trying to understand a new industry, state-of-the-art on a tech etc.
lexandstuff · a month ago
Out of interest, has anyone ever integrated with Grok? I've done so many LLM integrations in the last few years, but never heard of anyone choosing Grok. I feel like they are going to need an unmistakably capable model before anyone would want to risk it - they don't behave like a serious company.
47thpresident · a month ago
Grok 3 is on Azure AI Foundary [0] and announced an integration with Telegram, albeit they are paying Telegram $300m not vice versa [1]. But I agree, choosing Grok is just a huge reputational liability for anyone’s work that is serious.

[0] https://devblogs.microsoft.com/foundry/announcing-grok-3-and... [1] https://www.bbc.co.uk/news/articles/cdxvr3n7wlxo

thebigspacefuck · a month ago
Any plans for GCP Vertex AI or AWS Bedrock? Apparently Grok 3 had the highest score for Golang on roocode.com/evals so I’d like to try it for coding. The free tier app hasn’t been bad either, I like it’s attitude a bit better than ChatGPT.
Workaccount2 · a month ago
I'm more curious where Grok gets talent from.

There is so much money and so many top labs falling over themselves to attract good talent, that at this point people have to be leaning on ideological goals to choose their employer.

Are there really that many AI researchers who want to make Elon god-emperor?

qoez · a month ago
I read the last election and other signals as the idea that there's way more unspoken diversity of thought in peoples minds than what people feel safe to say. Secretly lots of top talent probably doesn't care or even aligns with elon but chooses to say so at most with their actions in the form of being ok working for him.
hbn · a month ago
A lot of serious engineers would love to work in an environment that isn't the HR-reigning office politics bullshit standard of the past decade or two.

I don't even really like Elon but I bet the engineers at X are having a better time in their day-to-day than the ones at Meta or Google where all their work is constantly roadblocked by red tape, in-fighting, and PMs whose only goal is to make it look like they headed something important to get themselves promoted. Elon's at least got a vision and keeps it a top priority to be competitive in the AI space.

nashadelic · a month ago
I also feel Elon's team has been "untouchable" for Zuck and doesn't want to stir anything with him. But since his falling out of grace with the admin that could change?
dotnet00 · a month ago
If you're focusing on ideology, it isn't like the other companies are all that good. With Sam Altman you're still working for a pathological liar with delusions of grandeur. With Google and Meta you're propping up a massive worldwide surveillance apparatus.

Tech-bros have been propping up agents/propagators of some of the biggest social ills of the past ~2 decades, xAI isn't all that different.

brcmthrowaway · a month ago
He must be paying them millions
sergiotapia · a month ago
I am using Grok to visually analyze food images. Works really well, recognizes brands and weird shots users send me. API really easy to use.
hersko · a month ago
You would have to be insane to integrate the model that last week called itself "Mecha Hitler" into your live product.

As a huge Musk fan i'll be the first to point out how he's doing exactly what he accused Sama of doing; making powerful ai with an obvious lack of control or effective alignment.

Dead Comment