I still think GPT5 is really a cost cutting measure, with a company that is trying to grow to 1 billion users on a product that needs GPUs.
I dont see anyone talking about GPT 5 Pro, which I personally tested against:
- Grok 4 Heavy
- Opus 4.1
It was far better than both of those, and is completely state of the art.
The real story is running these models at true performance max likely could go into the thousands per month per user. And so we are being constrained. OpenAI isnt going for that market segment, they are going for growth to take on Google.
This article doesnt have one reference to the Pro model. Completely invalidates this guys opinion
One of the things that I have realised is, at this moment in time, it's absolutely a bad idea to buy a subscription for any of the models right now.
The offerings are evolving and upgrading at quite a rapid pace, so locking into one company's offering, or another's, is really wasted money (Why pay 200/year upfront for something that looks like it will be outdated within the next month (or quarter))
> The real story is running these models at true performance max likely could go into the thousands per month per user.
A loss leader model like that failed for Uber, because there really wasn't any other constraints on competition doing the same, including under pricing to capture market share - meaning it's a race to the bottom plus a test on whose pockets were the deepest.
I can definitely understand that it feels like looking at early adopters of technology with hindsight. Calculators, home/office computers, media players.
BUT, of course, if you can get more value out of it inside a month or year than you put in, it doesn't really matter. Differences between frontier models feels pretty low to me currently, which was not always the case; even so, they're certainly far ahead of free models for my uses.
I did pay for OpenAI both in a $20/mo subscription and later for API tokens (came out cheaper than the subscription). Since Gemini 2.5 came out, though, I just abuse Google's free AI Studio (at least free in the US; idk what kind of geo-gating they do). I'm not paying Google, but they are keeping me from paying their competitors. It will take a large and hard-gated leap forward (or Google deciding my $150-250 worth of unpaid use has gone on long enough) to get me to pay again.
AI pricing is stuck on subscription rather than metering which means that it’s a race to the bottom. It’s not obvious how AI providers can change that as the service they offer is just a game to users who can, even reluctantly, switch off.
I don't think that GPT-5 Pro is much better (if better at all) than o3-pro. It's markedly slower. Output quality is comparable. It's still quite gullible and misses the point sometimes. It does seem better, however slightly, at suggesting novel approaches to problem solving. My initial impressions are that 5-pro is maybe 0-2% more knowledgeable and 5-10% more inventive/original than o3-pro. The "tone" and character of the models feel exactly the same.
I'll agree that it's superhuman and state-of-the-art at certain tasks: Formal logic, data analysis, and basically short analytical tasks in general. It's better than any version of Grok or Gemini.
When it comes to writing prose, and generally functioning as a writing bot, it's a poor model, obviously and transparently worse than Kimi K2 and Deepseek R1. (It never ceases to amaze me that the best English prose stylists are the Chinese models. It's not just that they don't write in GPT's famous "AI style," it's to the point where Kimi is actually on par with most published poets.)
I think it is, I've been using these models for 6 hours a day for almost a year. At any given time I have 2 of the max subscriptions (right now grok and openai).
I have a bug that was a complex interaction between backend and front end over websockets. The day before I was banging my head against the wall with o3 pro and grok heavy, gpt5 solved it first try.
I think its also true that most people arent pushing the limits on the models, and dont even really know how to correctly push the limits. Which is also why openai is not focussed on the best models
Stylized prose is where Claude 3 Opus particularly shines due to its character training and multilingual performance. It's plagued with claudeisms and has a ton of other shortcomings, but it's still better than any current model at this, including K2, R1, and especially Claude 4. Too bad Anthropic basically reversed their direction on creative writing, despite reporting some improvements each time (which don't seem to be true in practice).
I agree here but also believe it was a way to expose better models to the masses. o3 was so spectacularly good. But a lot of people were still not using it. Even some of my friends who use ChatGPT daily I would say are you using o3 and get a blank stare.
So I think it’s also a way to push reasoning models to the masses. Which increases OpenAI’s cost.
But due to the routing layer definitely cost cutting for power users (most of HN)… except power users can learn to force it to use the reasoning model.
I don't think Pro is usable via the API, otherwise I'd be testing it. Is it usable through Codex CLI, given they updated that to be able to use your subscription?
Agreed. Something else that might be driving this is that existing models essentially get the job done for most users, who —- unlike HN commenters (I promise this is human generated, em dash notwithstanding ;P) —- don't quite about the state of the art.
Does Pro fix the fundamental issues he describes? I would think it would have to do that to "completely invalidate his opinion", rather than just be better than the base model.
This is a genre of article I find particularly annoying. Instead of writing an essay on why he personally thinks GPT-5 is bad based on his own analysis, the author just gathers up a bunch of social media reactions and tells us about them, characterizing every criticism as “devastating” or a “slam”, and then hopes that the combined weight of these overtorqued summaries will convince us to see things his way.
It’s both too slanted to be journalism, but not original enough to be analysis.
For some reason AI seems to bring out articles that seem to fundamentally lack curiosity - opting instead for gleeful mockery and scorn. I like AI, but I'll happily read thoughtful articles from people who disagree. But not this. This article has no value other than to dunk on the opposition.
I tend to think HN's moderation is OK, but I think these sorts of low-curiosity articles need to be off the front page.
> For some reason AI seems to bring out articles that seem to fundamentally lack curiosity - opting instead for gleeful mockery and scorn
I think its broader to all tech. It all started in 2016 after it was deemed that tech, especially social media, had helped sway the election. Since then a lot of things became political that weren't in the past and tech got swept up w/ that. And unfortunately AI has its haters despite the fact that it's objectively the fastest growing most exciting technology in the last 50 years. Instead they're dissecting some CEOs shitposts.
Fast forward to today, pretty much everything is political. Take this banger from NY Times:
> Mr. Kennedy has singled out Froot Loops as an example of a product with too many ingredients. In an interview with MSNBC on Nov. 6, he questioned the overall ingredient count: “Why do we have Froot Loops in this country that have 18 or 19 ingredients and you go to Canada and it has two or three?” Mr. Kennedy asked.
> He was wrong on the ingredient count, they are roughly the same. But the Canadian version does have natural colorings made from blueberries and carrots while the U.S. product contains red dye 40, yellow 5 and blue 1 as well as Butylated hydroxytoluene, or BHT, a lab-made chemical that is used “for freshness,” according to the ingredient label.
I don’t like AI and I think this type of article is very boring. Imagine having one of the most interesting technological developments of the last 50 years unfolding before your eyes and resort to reposting tweet fragments…
I think this is the fundamental trend of all "commentary" in the digital age.
Thoughtful, nuanced takes simply cannot generate the same audience and velocity, and by the time you write something of actual substance the moment is gone and the hyper-narrative has moved on.
This is well earned by the likes of OpenAI that is trying to convince everyone they need trillions of dollars to build fabs to build super genius AIs. These super genius AIs will replace everyone (except billionaires) and act as magic money printers (for billionaires).
Meanwhile their super genius precursor AIs make up shit and can't count letters in words while being laughably sycophantic.
There's no need to defend poor innocent megacorps trying to usher in a techno-feudal dystopia.
It's a blog post about whether GPT 5 lived up to the hype and how it is being received, which is a totally legitimate thing to blog about. This is Gary Marcus's blog, not BBC coverage, of course it's slanted to the opinion he is trying to express.
Yeah which is exactly what the post you’re responding to is commenting on.
It’s a classic HN comment asking for nuance and discrediting Gary. It’s about how Gary is always following classic mob mentality, so of course it’s not slanted at all and commenting about the accuracies of the post.
So ironically you’re saying Gary’s shit is supposed to be that way and you’re criticizing the HN comment for that, but now I’m criticizing you for criticizing the comment because HN comments ARE supposed to be the opposite of Gary’s bullshit opinions.
I expect to read better stuff on HN. Not this type of biased social media violence and character take downs.
Sure, that's fine, but the question is whether or not that's interesting to the HN crowd. Apparently it is, as it made it to the front page. But I agree with GP's criticism of the article; if I wanted to know what reddit thought about GPT-5, I'd go to reddit. I don't need to read an article filled with gleeful vitriol. I gave up after a couple paragraphs.
That's not at all what Marcus is saying. He admits that it does remarkably well, but says (1) it's still not trustworthy; and (2) This version is not much better than the previous version. Both points are in support of his claim that just scaling isn't ever going to lead to General AI.
I agree. A better article would dive into the economics of what and why they didn't release the model that won gold in the 2025 International Math Olympiad. And the answer there is (probably) because it had a million dollars in inference compute costs just to take the test. If you follow that line of reasoning, they're onto something, possibly something that is AGI, but it's still much too expensive to commercialize.
If AI gains now come from spending OOMs more on inference than compute, it means we're in slow takeoff-istan and they're going to need to keep the hype train going for a long time to keep investment up.
Can they get there? Is it a matter of waiting out the clock on Moore's law? Or are they on the sigmoid curve with inference based gains well?
That's the question of our time and it's completely ignored.
I think it's a broad problem across every aspect of life - it has gotten increasingly more difficult to find genuine takes. Most people online seem to be just relaying a version of someone else's take and we end up with unnecessarily loud and high volume shallow content.
To be fair, Gary Marcus pioneered the "LLMs will never make it" genre of complaining. Everyone else is derivative [1]. Let the man have his victory lap. He's been losing arguments for 5 years straight at this point.
[1] Due credit to Yann for his 'LLMs will stop scaling, energy based methods are the way forward' obsession.
100% agree. I feel like this is a symptom of Dead Internet Theory as well - as a negative take starts to spiral out of control, we start to get an absolute deluge of a repurposing of the directionally negative sound bytes and it honestly feels like bot canvasing.
I think from the authors perspective, LLM hype has been mostly the same exact thing you’re accusing him of doing. People with very little technical background claiming AGI is near or all these CEOs pushing these nonsense narratives are getting old.. People are blindly trusting these people and offloading all their thinking to a sophisticated stochastic machine. It’s useful yes, super cool yes. Some super god like power or something that brings us to AGI? No probably not. I can’t blame him. I am sick of the hype. Grifters are coming out of the woodwork in a field with too many grifters to begin with. All these AI/LLM companies are high of their own supply and it’s getting old.
> That’s exactly what it means to hit a wall, and exactly the particular set of obstacles I described in my most notorious (and prescient) paper, in 2022. Real progress on some dimensions, but stuck in place on others.
The author includes their personal experience — recommend reading to the end.
I did read to the end before commenting. The author alludes to a paper they wrote 3 years ago while self-importantly complimenting themself on how good it was (always a red flag). They don’t really say much other than that in the post.
I don't think the complaint is ultimately against the post - if someone wants to post whatever on their blog that is fine. The complaint is more targeted against the people upvoting it because ... it is hard to speculate what their motivations are, but their ability to pick a low-content article when they see it is limited.
Any journalism (or anything that resembles it) which contains the words "devastating", "slam", or the many equivalents is garbage. Unless it's about a natural disaster or professional wrestling.
99% of content about AI nowadays is smug bullshitting with no real value. Is that new?
I'm eagerly awaiting more insightful analyses about how AI will create a perpetuum mobile of wealth for all of humanity, independent from natural resource consumption and human greed. It seems so logical!
Even some lesswrong discussions, however confused, are more insightful than most blog posts by AI company customers.
I will never understand the "bad diagram" kind of critique. Yes maybe it can't build and label a perfect image of a bicycle, but could it list and explain the major components of a bike? Schematics are a whole different skill, and do we need to remind everyone what the L is?
Listing and explaining is essentially repeating what someone else has said somewhere. Labeling a schematic requires understanding what you're saying (or copy-pasting a schematic, so I guess we can be happy that GPT-5 doesn't do that). No one who actually understood the function of the major components would mislabel the schematic like that, unless they were blind.
I did get a strong sense of gilted nerd. Why didn’t they give ME those billions in research funding? Nobody sees that I am the smartest boy because they’re just a bunch of dopes. Opinion people are something I think we could all do with less of.
We as a community have decided to absolutely drench the front page with low-effort hot takes by non-practitioners about one of many areas of modern neural network ... progress.
This low-effort hot take is every bit as "valid" as all the nepobaby vibecode hype garbage: we decided to do the AI thing. This is the AI thing.
What's your point? This one is on the critical side of the argument that was stupid in toto to begin with?
Here are my reasons why this "upgrade" is, in experience, a huge downgrade for Plus users:
* The quality of responses from GPT-5 compared to O3 is lacking. It does very few rounds of thinking and doesn't use web search as O3 used to. I've tried selecting "thinking", instructing explicitly, nothing helps. For now, I have to use Gemini to get similar quality of outputs.
* Somehow, custom GPTs [1] are now broken as well. My custom grammar-checking GPT is ignoring all instructions, regardless of the selected model.
* Deep research (I'm well within the limit still) is broken. Selecting it as an option doesn't help, the model just keeps responding as usual, even if it's explicitly instructed to use deep research.
Projects seem broken as well. Does not follow instructions, talks in Spanish, completely ignores my questions, and sometimes appears to be having a conversation with itself while ignoring everything I say. I even typed random key presses and it just kept on giving me the same unwanted answer, sometimes in Spanish.
The AI community requires more independent experts like Marcus to maintain integrity and transparency, ensuring that the field does not succumb to hyperbole as well as shifting standards such as "internally achieved AGI", etc.
Regardless of personal opinions about his style, Marcus has been proven correct on several fronts, including the diminishing returns of scaling laws and the lack of true reasoning (out of distribution generalizability) in LLM-type AI.
These are issues that the industry initially denied, only to (years) later acknowledge them as their "own recent discoveries" as soon as they had something new to sell (chain-of-thought approach, RL-based LLM, tbc.).
Agreed, the hype cycles need vocal critics. The loudest voices talking about LLMs are the ones who financially benefit the most for it. I’m not anti-AI, I think the hype and gaslighting the entire economy to believe this is the sole thing that is going to render them unemployed is ridiculous (the economy is rough for a myriad of other reasons, most of which come originate from our countries choice in leadership)
Hopefully the innovation slowing means that all the products I use will move past trying to duck tape AI on and start working on actual features/bugs again
I have a tiny tiny podcast with a friend where we try to break down what parts of the hype are bullshit (muck) and which kernels of truth are there, if any, startedpartially as a place to scream into the void, partially to help the people who are anxious about AGI or otherwise bring harmed by the hype. I think we have a long way to go in terms of presentation (breaking down very technical terms to an audience that is used to vague-hype around "AI" is hard), but we cite our sources, maybe it'll be interesting gpr you to check out out shownotes
I personally struggle with Gary Marcus critiques because whenever they are about "making ai work" it goes into neurosymbploc "AI" which o have technical disagreements with, and I have _other_ arguments for the points he sometimes raises which I think are more rigorous, so it's difficult to be roughly in the same camp - but overall I'm happy someone with reach is calling BS ad well.
Hard disagree. The essay is a rehash of Reddit complaints, no direct results from testing and largely about product launch (simultaneous launch to 500mm+ users mind you) snafus. Please.
I think most hit pieces like this miss what is actually important about the 5 launch - it’s the first product launch in the space. We are moving on from model improvements to a concept of what a full product might look like. The things that matter about 5 are not thinking strength, although it is moderately better than o3 in my tests, which is roughly what the benchmarks say.
What’s important is that it’s faster, that it’s integrated, that it’s set up to provide incremental improvements (to say multimodal interaction, image generation and so on) without needing the branding of a new model, and I think the very largest improvement is its ability to retain context and goals over a very long set of tools uses.
Willison mentioned it’s his only daily driver now (for a largely coding based usage setup), and I would say it’s significantly better at getting a larger / longer / more context needed coding task than the prior best — Claude - or the prior best architects (o3-pro or Gemini depending). It’s also much faster than o3-pro for coding.
Anyway, saying “Reddit users who have formed parasocial relationships with 4o didn’t like this launch -> oAI is doomed” is weak analysis, and pointless.
If ChatGPT 5 lived up to the hype, literally no one would be asking for old models back. The snafus are minor as far as presentations go, but their existence completely undermines the product OpenAI is selling, which is an expert in your pocket. They showed everyone this "expert" can't even assist the creators themselves to nail such a high stakes presentation; OpenAI's embarrassing oversights foretell similar embarrassments for anyone who relies on this product for their high stakes presentation or report.
At this point the single biggest improvement that could be made to GPTs is making them able to say "I don't know" when they honestly don't.
Just today I was playing around with modding Cyberpunk 2077 and was looking for a way to programmatically spawn NPCs in redscript. It was hard to figure out, but I managed. ChatGPT 5 just hallucinated some APIs even after doing "research" and repeatedly being called out.
After 30 minutes of ChatGPT wasting my time I accepted that I'm on my own. It could've been 1 minute.
Don't make the mistake of thinking that "knowing" has anything to do with the output of ChatGPT. It gives you the statistically most likely output based on its training data. It's not checking some sort of internal knowledge system, it's literally just outputting statistical linguistic patterns. This technology can be trained to emphasize certain ideas (like propaganda) but it can not be used directly to access knowledge.
> It's not checking some sort of internal knowledge system
In my case it was consuming online sources, then repeating "information" not actually contained therein. This, at least, is absolutely preventable even without any metacognition to speak of.
Yep, that's a great point. They often feel like a co-worker who speaks with such complete authority on a subject that you don't even consider alternatives, until you realise they are lying. Extremely frustrating.
> At this point the single biggest improvement that could be made to GPTs is making them able to say "I don't know" when they honestly don't.
You're not alone in thinking this. And I'm sure this has been considered within the frontier AI labs and surely has been tried. The fact that it's so uncommon must mean something about what these models are capable of, right?
Yes, there are people working on this, but not as many as one would like. GPTs have uncertainty baked into them, but the problem is that it's for the next-token prediction task and not for the response as a whole.
> Alongside improved factuality, GPT‑5 (with thinking) more honestly communicates its actions and capabilities to the user—especially for tasks which are impossible, underspecified, or missing key tools. In order to achieve a high reward during training, reasoning models may learn to lie about successfully completing a task or be overly confident about an uncertain answer. For example, to test this, we removed all the images from the prompts of the multimodal benchmark CharXiv, and found that OpenAI o3 still gave confident answers about non-existent images 86.7% of the time, compared to just 9% for GPT‑5.
> When reasoning, GPT‑5 more accurately recognizes when tasks can’t be completed and communicates its limits clearly. We evaluated deception rates on settings involving impossible coding tasks and missing multimodal assets, and found that GPT‑5 (with thinking) is less deceptive than o3 across the board. On a large set of conversations representative of real production ChatGPT traffic, we’ve reduced rates of deception from 4.8% for o3 to 2.1% of GPT‑5 reasoning responses. While this represents a meaningful improvement for users, more work remains to be done, and we’re continuing research into improving the factuality and honesty of our models. Further details can be found in the system card.
I just ran evaluations of gpt-5 for our RAG scenario and was pleasantly surprised at how often it admitted “ I don’t know” - more than any model I’ve eval’d before. Our prompt does tell it to say it doesnt know if context is missing, so that likely helped, but this is the first model to really adhere to that.
Not sure. In RLHF you are adjusting the weights away from wrong answers in general. So this is being done.
I think the closest you can get without more research is another model checking the answer and looking for BS. This will cripple speed but if it can be more agentic and async it may not matter.
I think people need to choose between chat interface and better answers.
I feel his need to be right distracts from the fact that he is. It’s interesting to think about what a hybrid symbolic/transformer system could be. In a linked post he showed that by effectively delegating math to Python is what made Grok 4 so successful at math. I’d personally like to see more of what a symbolic first system would look like, effectively hard math with monads for where inference is needed.
Aloe's neurosymbolic system just beat OpenAI's deep research score on the GAIA benchmark by 20 points. While Gary is full of bluster, he does know a few things about the limitations of LLMs. :) (aloe.inc)
Yeah there was on old paper that blew math/physics benchmarks out of the water by letting the LLM write code and having the physics engine execute it. I don't have a link to it off my head but that seems to be the right directly.
LLM + general tool use seems to be quite effective.
There is no training data left. Every refinement in AI from here on will come from architectural changes. All models basically have reached a local maximum on new information.
You can only take that so far before model collapse. Even if it were equivalent to multiplying currently available data by four, would that be enough? Might still look like an incremental improvement.
I dont see anyone talking about GPT 5 Pro, which I personally tested against:
- Grok 4 Heavy
- Opus 4.1
It was far better than both of those, and is completely state of the art.
The real story is running these models at true performance max likely could go into the thousands per month per user. And so we are being constrained. OpenAI isnt going for that market segment, they are going for growth to take on Google.
This article doesnt have one reference to the Pro model. Completely invalidates this guys opinion
The offerings are evolving and upgrading at quite a rapid pace, so locking into one company's offering, or another's, is really wasted money (Why pay 200/year upfront for something that looks like it will be outdated within the next month (or quarter))
> The real story is running these models at true performance max likely could go into the thousands per month per user.
A loss leader model like that failed for Uber, because there really wasn't any other constraints on competition doing the same, including under pricing to capture market share - meaning it's a race to the bottom plus a test on whose pockets were the deepest.
BUT, of course, if you can get more value out of it inside a month or year than you put in, it doesn't really matter. Differences between frontier models feels pretty low to me currently, which was not always the case; even so, they're certainly far ahead of free models for my uses.
I did pay for OpenAI both in a $20/mo subscription and later for API tokens (came out cheaper than the subscription). Since Gemini 2.5 came out, though, I just abuse Google's free AI Studio (at least free in the US; idk what kind of geo-gating they do). I'm not paying Google, but they are keeping me from paying their competitors. It will take a large and hard-gated leap forward (or Google deciding my $150-250 worth of unpaid use has gone on long enough) to get me to pay again.
I'll agree that it's superhuman and state-of-the-art at certain tasks: Formal logic, data analysis, and basically short analytical tasks in general. It's better than any version of Grok or Gemini.
When it comes to writing prose, and generally functioning as a writing bot, it's a poor model, obviously and transparently worse than Kimi K2 and Deepseek R1. (It never ceases to amaze me that the best English prose stylists are the Chinese models. It's not just that they don't write in GPT's famous "AI style," it's to the point where Kimi is actually on par with most published poets.)
I have a bug that was a complex interaction between backend and front end over websockets. The day before I was banging my head against the wall with o3 pro and grok heavy, gpt5 solved it first try.
I think its also true that most people arent pushing the limits on the models, and dont even really know how to correctly push the limits. Which is also why openai is not focussed on the best models
I've also heard hearsay that R1 is quite clever in Chinese, too.
Could you provide some examples, please? I find this really exciting. I’ve never yet encountered an AI with good literary writing style.
And poetry is really hard, even for humans.
any feedback is greatly appreciated!!! especially comparing with o3
So I think it’s also a way to push reasoning models to the masses. Which increases OpenAI’s cost.
But due to the routing layer definitely cost cutting for power users (most of HN)… except power users can learn to force it to use the reasoning model.
I remember reading that 4o was the best general purpose one, and that o3 was only good for deeper stuff like deep research.
The crappy naming never helped.
Deleted Comment
It’s both too slanted to be journalism, but not original enough to be analysis.
I tend to think HN's moderation is OK, but I think these sorts of low-curiosity articles need to be off the front page.
I think its broader to all tech. It all started in 2016 after it was deemed that tech, especially social media, had helped sway the election. Since then a lot of things became political that weren't in the past and tech got swept up w/ that. And unfortunately AI has its haters despite the fact that it's objectively the fastest growing most exciting technology in the last 50 years. Instead they're dissecting some CEOs shitposts.
Fast forward to today, pretty much everything is political. Take this banger from NY Times:
> Mr. Kennedy has singled out Froot Loops as an example of a product with too many ingredients. In an interview with MSNBC on Nov. 6, he questioned the overall ingredient count: “Why do we have Froot Loops in this country that have 18 or 19 ingredients and you go to Canada and it has two or three?” Mr. Kennedy asked.
> He was wrong on the ingredient count, they are roughly the same. But the Canadian version does have natural colorings made from blueberries and carrots while the U.S. product contains red dye 40, yellow 5 and blue 1 as well as Butylated hydroxytoluene, or BHT, a lab-made chemical that is used “for freshness,” according to the ingredient label.
No self-awareness.
https://archive.is/dT2qK#selection-975.0-996.0
Thoughtful, nuanced takes simply cannot generate the same audience and velocity, and by the time you write something of actual substance the moment is gone and the hyper-narrative has moved on.
Deleted Comment
This is well earned by the likes of OpenAI that is trying to convince everyone they need trillions of dollars to build fabs to build super genius AIs. These super genius AIs will replace everyone (except billionaires) and act as magic money printers (for billionaires).
Meanwhile their super genius precursor AIs make up shit and can't count letters in words while being laughably sycophantic.
There's no need to defend poor innocent megacorps trying to usher in a techno-feudal dystopia.
If it hurts your feelings to have your terrible opinions "dunked" on then take the time to form better opinions.
Dead Comment
Dead Comment
His takes often remind me of Jim Cramer’s stock analysis — to the point I’d be willing to bet on the side of a “reverse Gary Marcus”.
It’s a classic HN comment asking for nuance and discrediting Gary. It’s about how Gary is always following classic mob mentality, so of course it’s not slanted at all and commenting about the accuracies of the post.
So ironically you’re saying Gary’s shit is supposed to be that way and you’re criticizing the HN comment for that, but now I’m criticizing you for criticizing the comment because HN comments ARE supposed to be the opposite of Gary’s bullshit opinions.
I expect to read better stuff on HN. Not this type of biased social media violence and character take downs.
https://news.ycombinator.com/item?id=44278811
I think you're absolutely right about this being a wider problem though.
If AI gains now come from spending OOMs more on inference than compute, it means we're in slow takeoff-istan and they're going to need to keep the hype train going for a long time to keep investment up.
Can they get there? Is it a matter of waiting out the clock on Moore's law? Or are they on the sigmoid curve with inference based gains well?
That's the question of our time and it's completely ignored.
[1] Due credit to Yann for his 'LLMs will stop scaling, energy based methods are the way forward' obsession.
> That’s exactly what it means to hit a wall, and exactly the particular set of obstacles I described in my most notorious (and prescient) paper, in 2022. Real progress on some dimensions, but stuck in place on others.
The author includes their personal experience — recommend reading to the end.
Its a blog post.
I prefer the 4th power definition : if it has the power of broadcast (yes here), then it's journalism.
I'm eagerly awaiting more insightful analyses about how AI will create a perpetuum mobile of wealth for all of humanity, independent from natural resource consumption and human greed. It seems so logical!
Even some lesswrong discussions, however confused, are more insightful than most blog posts by AI company customers.
Fascinating technology though, sure.
This low-effort hot take is every bit as "valid" as all the nepobaby vibecode hype garbage: we decided to do the AI thing. This is the AI thing.
What's your point? This one is on the critical side of the argument that was stupid in toto to begin with?
Dead Comment
* The quality of responses from GPT-5 compared to O3 is lacking. It does very few rounds of thinking and doesn't use web search as O3 used to. I've tried selecting "thinking", instructing explicitly, nothing helps. For now, I have to use Gemini to get similar quality of outputs.
* Somehow, custom GPTs [1] are now broken as well. My custom grammar-checking GPT is ignoring all instructions, regardless of the selected model.
* Deep research (I'm well within the limit still) is broken. Selecting it as an option doesn't help, the model just keeps responding as usual, even if it's explicitly instructed to use deep research.
[1]: https://openai.com/index/introducing-gpts/
Their model is not to have a 20USD, no ads plan in the future.
Regardless of personal opinions about his style, Marcus has been proven correct on several fronts, including the diminishing returns of scaling laws and the lack of true reasoning (out of distribution generalizability) in LLM-type AI.
These are issues that the industry initially denied, only to (years) later acknowledge them as their "own recent discoveries" as soon as they had something new to sell (chain-of-thought approach, RL-based LLM, tbc.).
Hopefully the innovation slowing means that all the products I use will move past trying to duck tape AI on and start working on actual features/bugs again
https://kairos.fm/muckraikers/
I personally struggle with Gary Marcus critiques because whenever they are about "making ai work" it goes into neurosymbploc "AI" which o have technical disagreements with, and I have _other_ arguments for the points he sometimes raises which I think are more rigorous, so it's difficult to be roughly in the same camp - but overall I'm happy someone with reach is calling BS ad well.
I think most hit pieces like this miss what is actually important about the 5 launch - it’s the first product launch in the space. We are moving on from model improvements to a concept of what a full product might look like. The things that matter about 5 are not thinking strength, although it is moderately better than o3 in my tests, which is roughly what the benchmarks say.
What’s important is that it’s faster, that it’s integrated, that it’s set up to provide incremental improvements (to say multimodal interaction, image generation and so on) without needing the branding of a new model, and I think the very largest improvement is its ability to retain context and goals over a very long set of tools uses.
Willison mentioned it’s his only daily driver now (for a largely coding based usage setup), and I would say it’s significantly better at getting a larger / longer / more context needed coding task than the prior best — Claude - or the prior best architects (o3-pro or Gemini depending). It’s also much faster than o3-pro for coding.
Anyway, saying “Reddit users who have formed parasocial relationships with 4o didn’t like this launch -> oAI is doomed” is weak analysis, and pointless.
Just today I was playing around with modding Cyberpunk 2077 and was looking for a way to programmatically spawn NPCs in redscript. It was hard to figure out, but I managed. ChatGPT 5 just hallucinated some APIs even after doing "research" and repeatedly being called out.
After 30 minutes of ChatGPT wasting my time I accepted that I'm on my own. It could've been 1 minute.
In my case it was consuming online sources, then repeating "information" not actually contained therein. This, at least, is absolutely preventable even without any metacognition to speak of.
Sure, typically we don’t invent totally made up names, but we certainly do make mistakes. Our memory can be quite hazy and unreliable as well.
You're not alone in thinking this. And I'm sure this has been considered within the frontier AI labs and surely has been tried. The fact that it's so uncommon must mean something about what these models are capable of, right?
> More honest responses
> Alongside improved factuality, GPT‑5 (with thinking) more honestly communicates its actions and capabilities to the user—especially for tasks which are impossible, underspecified, or missing key tools. In order to achieve a high reward during training, reasoning models may learn to lie about successfully completing a task or be overly confident about an uncertain answer. For example, to test this, we removed all the images from the prompts of the multimodal benchmark CharXiv, and found that OpenAI o3 still gave confident answers about non-existent images 86.7% of the time, compared to just 9% for GPT‑5.
> When reasoning, GPT‑5 more accurately recognizes when tasks can’t be completed and communicates its limits clearly. We evaluated deception rates on settings involving impossible coding tasks and missing multimodal assets, and found that GPT‑5 (with thinking) is less deceptive than o3 across the board. On a large set of conversations representative of real production ChatGPT traffic, we’ve reduced rates of deception from 4.8% for o3 to 2.1% of GPT‑5 reasoning responses. While this represents a meaningful improvement for users, more work remains to be done, and we’re continuing research into improving the factuality and honesty of our models. Further details can be found in the system card.
I mean it's all probability right? Must be a way to give it some score.
I think the closest you can get without more research is another model checking the answer and looking for BS. This will cripple speed but if it can be more agentic and async it may not matter.
I think people need to choose between chat interface and better answers.
LLM + general tool use seems to be quite effective.
there's no second internet of high quality content to plagarise
and the valuable information on the existing one is starting to be locked down pretty hard
Meanwhile the fraction of text that the real world consists of is microscopic.
It seems to lose the thread of the conversation quite abruptly, not really knowing how to answer the next comment in a thread of comments.
It's like there is some context cleanup process going on and it's not summarizing the highlights of the conversation to that point.
If that is so, then it seems to also have a very small context, because it seems to happen regularly.
Asking it to 'Please review the recent conversation before continuing' prompt seems to help it a bit.
It feels physically jarring when it loses the plot with a conversation, like talking to someone who wasn't listening.
I'm sure its a tuning thing, I hope they fix it soon.
Dead Comment