OpenAI o3 and o4-mini

Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...

With right knowledge and web searches one can answer this question in a matter of minutes at most. The model fumbled around modding forums and other sites and did manage to find some good information but then started to hallucinate some details and used them in the further research. The end result it gave me was incorrect, and the steps it described to get the value were totally fabricated.

What’s even worse in the thinking trace it looks like it is aware it does not have an answer and that the 399 is just an estimate. But in the answer itself it confidently states it found the correct value.

Essentially, it lied to me that it doesn’t really know and provided me with an estimate without telling me.

Now, I’m perfectly aware that this is a very niche topic, but at this point I expect the AI to either find me a good answer or tell me it couldn’t do it. Not to lie me in the face.

Edit: Turns out it’s not just me: https://x.com/transluceai/status/1912552046269771985?s=46

int_19h · 4 months ago

Compare to Gemini Pro 2.5:

https://g.co/gemini/share/c8fb1c9795e4

Of note, the final step in the CoT is:

> Formulate Conclusion: Since a definitive list or count isn't readily available through standard web searches, the best approach is to: state that an exact count is difficult to ascertain from readily available online sources without direct analysis of game files ... avoid giving a specific number, as none was reliably found across multiple sources.

and then the response is in line with that.

M4v3R · 4 months ago

I like this answer. It does mention the correct, definitive way of getting the information I want (extracting the char.lgp data file) and so even though it gave up it pushes you in the right direction, whereas o3/o4 just make up stuff.

werdnapk · 4 months ago

I've used AI with "niche" programming questions and it's always a total let down. I truly don't understand this "vibe coding" movement unless everyone is building todo apps.

SkyPuncher · 4 months ago

There's a bit of a skill to it.

Good architecture plans help. Telling it where in an existing code base it can find things to pattern match against is also fantastic.

I'll often end up with a task that looks something like this:

* Implement Foo with a relation to FooBar.

* Foo should have X, Y, Z features

* We have an existing pattern for Fidget in BigFidget. Look at that for implementation

* Make sure you account for A, B, C. Check Widget for something similar.

It works surprisingly well.

hatefulmoron · 4 months ago

It's incredible when I ask Claude 3.7 a question about Typescript/Python and it can generate hundreds of lines of code that are pretty on point (it's usually not exactly correct on first prompt, but it's coherent).

I've recently been asking questions about Dafny and Lean -- it's frustrating that it will completely make up syntax and features that don't exist, but still speak to me with the same confidence as when it's talking about Typescript. It's possible that shoving lots of documentation or a book about the language into the context would help (I haven't tried), but I'm not sure if it would make up for the model's lack of "intuition" about the subject.

mikepurvis · 4 months ago

I'm trialing co-pilot in VSCode and it's a mixed bag. Certain things it pops out great, but a lot of times I'll be like woohoo! <tab> <tab> <tab> and then end up immediately realising wait a sec, none of this is actually needed, or it's just explicitly calling for things that are already default values, or whatever.

(This is particularly in the context of metadata-type stuff, things like pyproject files, ansible playbooks, Dockerfiles, etc)

chaboud · 4 months ago

I recently exclaimed that “vibe coding is BS” to one of my coworkers before explaining that I’ve actually been using GPT, Claude, llama (for airplanes), Cline, Cursor, Windsurf, and more for coding for as long as they’ve been available (more recently playing with Gemini). Cline + Sonnet 3.7 has been giving me great results on smaller projects with popular languages, and I feel truly fortunate to have AWS Bedrock on tap to drive this stuff (no effective throttling/availability limits for an individual dev). Even llama + Continue has proven workable (though it will absolutely hallucinate language features and APIs).

That said, 100% pure vibe coding is, as far as I can tell, still very much BS. The subtle ugliness that can come out of purely prompt-coded projects is truly a rat hole of hate, and results can get truly explosive when context windows saturate. Thoughtful, well-crafted architectural boundaries and protocols call for forethought and presence of mind that isn’t yet emerging from generative systems. So spend your time on that stuff and let the robots fill in the boilerplate. The edges of capability are going to keep moving/growing, but it’s already a force multiplier if you can figure out ways to operate.

For reference, I’ve used various degrees of assistance for color transforms, computer vision, CNN network training for novel data, and several hundred smaller problems. Even if I know how to solve a problem, I generally run it through 2-3 models to see how they’ll perform. Sometimes they teach me something. Sometimes they violently implode, which teaches me something else.

ecocentrik · 4 months ago

People who embracing vibe coding are probably the same people who were already sudo-vibe coding to begin with using found fragments of code they could piece together to make things sort of work for simple tasks.

killerdhmo · 4 months ago

I mean, I don't think you need to do cutting edge programming to make something personal to you. See here from Canva's product. Check this out: https://youtu.be/LupwvXsOQqs?t=2366

motorest · 4 months ago

> I've used AI with "niche" programming questions and it's always a total let down.

That's perfectly fine. It just means you tried without putting in any effort and failed to get results that were aligned with your expectations.

I'm also disappointed when I can't dunk or hit >50% of my 3pt shots, but then again I never played basketball competitively

> I truly don't understand this "vibe coding" movement unless everyone is building todo apps.

Yeah, I also don't understand the NBA. Every single one of those players show themselves dunking and jumping over cars and having almost perfect percentages in 3pt shots during practice, whereas I can barely get off my chair. The problem is certainly basketball.

lend000 · 4 months ago

I imagine after GPT-4 / o1, improvements on benchmarks have been increasingly a result of overfitting, because those breakthrough models already used most of the high quality training data that is available on the internet, there haven't been any dramatic architectural changes, we are already melting the world's GPUs, and there simply isn't enough new, high quality data being generated (orders of magnitudes more than what they already used on older models) to enable breakthrough improvements.

What I'd really like to see is the model development companies improving their guardrails so that they are less concerned about doing something offensive or controversial and more concerned about conveying their level of confidence in an answer, i.e. saying I don't know every once in a while. Once we get a couple years of relative stagnation in AI models, I suspect this will become a huge selling point and you will start getting "defense grade", B2B type models where accuracy is king.

siva7 · 4 months ago

It can imitate its creator. We reached AGI.

casinoplayer0 · 4 months ago

I wanted to believe. But not now.

hirvi74 · 4 months ago

Have you asked this same question to various other models out there in the wild? I am just curious if you have found some that performed better. I would ask some models myself, but I do not know the proper answer, so I would probably be gullible enough to believe whatever the various answers have in common.

shultays · 4 months ago

AIs in general are definitely hallucinating a lot more when it comes to niche topics. It is funny how they are unable to say "I don't know" and just make up things to answer your questions

felipeerias · 4 months ago

LLMs made me a lot more aware of leading questions.

Tiny changes in how you frame the same query can generate predictably different answers as the LLM tries to guess at your underlying expectations.

M4v3R · 4 months ago

Btw Ive also asked this question using Deep Research mode in ChatGPT and got the correct answer: https://chatgpt.com/share/68009a09-2778-8004-af40-4a8e7e812b...

So maybe this is just too hard for a “non-research” mode. I’m still disappointed it lied to me instead of saying it couldn’t find an answer.

shmerl · 4 months ago

How would it ever know the answer it found is true and correct though? It could as well just repeat some existing false answer that you didn't yet find on your own. That's not much better than hallucinating it, since you can't verify its truth without finding it independently anyway.

M4v3R · 4 months ago

I would be ok with having an answer and an explanation of how it got the answer with a list of sources. And it does just that - the only problem is that both the answer and the explanation are fabrications after you double check the sources.

Davidzheng · 4 months ago

Underwhelmed compared with Gemini 2.5 Pro--however it would've been impressive a month ago I think.

tern · 4 months ago

What's the correct answer? Curious if it got it right the second time: https://chatgpt.com/share/68009f36-a068-800e-987e-e6aaf190ec...

heavyset_go · 4 months ago

Same thing happened when asking it a fairly simple question about dracut on Linux.

If I went through with the changes it suggested, I wouldn't have a bootable machine.

yMEyUyNE1 · 4 months ago

> Not to lie me in the face.

Are you saying that, it deliberately lied to you?

> With right knowledge and web searches one can answer this question in a matter of minutes at most.

Reminded me of Dunning Kruger curve, the ai model at the first peak and you at the latter.

M4v3R · 4 months ago

> Are you saying that, it deliberately lied to you?

Pretty much yeah. Now “deliberately” does imply some kind of agency or even consciousness which I don’t believe these models have, its probably the result of overfitting, reward hacking or some other issues from training it, but the end result is that the model straight up misleads you knowingly (as in - the thinking trace is aware of the fact it doesn’t know the answer but it provides it anyways).

mountainriver · 4 months ago

Oh boy, here comes the “it didn’t work for this one specific thing I tried” posts

dragonmost · 4 months ago

But then how can you rely on it for things you don't know the answer to? The exercise just goes to show it still can't admit it doesn't know and lies instead.

switch(testFile) { case "test1.ase": // run this because it's a particular case case "test2.ase": // run this because it's a particular case default: // run something that's not working but that's ok because the previous case should // give the right output for all the test files ... }

To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.

GPT-4o mini: The new moon in August 2025 will occur on August 12.

Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.

Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.

o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). [...]

Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. [...]

I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.

And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

WhatIsDukkha · 4 months ago

I would never ask any of these questions of an LLM (and I use and rely on LLMs multiple times a day), this is a job for a computer.

I would also never ask a coworker for this precise number either.

achierius · 4 months ago

But it's a good reminder when so many enterprises like to claim that hallucinations have "mostly been solved".

jcynix · 4 months ago

My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be, would be a good test. Because plain folks will ask plain questions and won't think about the subtle details. They would not expect a "precise number" either, i.e. not 23:06 PDT, but would like to know if this weekend would be fine for a trip or the previous or next weekend would be better to book a "dark sky" tour.

And, BTW, I thought that LLMs are computers too ;-0

stavros · 4 months ago

First we wanted to be able to do calculations really quickly, so we built computers.

Then we wanted the computers to reason like humans, so we built LLMs.

Now we want the LLMs to do calculations really quickly.

It doesn't seem like we'll ever be satisfied.

ec109685 · 4 months ago

These models are proclaiming near AGI, so they should be smarter than hallucinating an answer.

pixl97 · 4 months ago

So I asked GPT-o4-mini-high

"On what date will the new moon occur on in August 2025. Use a tool to verify the date if needed"

It correctly reasoned it did not have exact dates due to its cutoff and did a lookup.

"The new moon in August 2025 falls on Friday, August 22, 2025"

Now, I did not specify the timezone I was in so our timing between 22 and 23 appears to be just a time zone difference at it had marked an time of 23:06 PDT per its source.

phoe18 · 4 months ago

Response from Gemini 2.5 Pro for comparison -

``` Based on the search results, the new moon in August 2025 will occur late on Friday, August 22nd, 2025 in the Pacific Time Zone (PDT), specifically around 11:06 PM.

In other time zones, like the Eastern Time Zone (ET), this event falls early on Saturday, August 23rd, 2025 (around 2:06 AM). ```

jcynix · 4 months ago

"Use a tool to verify the date if needed" that's a good idea, yes. And the answers I got are based on UTC, so 23:06 PDT should match the 23. for Europe.

My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be would be worth it.

ec109685 · 4 months ago

Even with a knowledge cutoff, you could know when a future new moon would be.

andrewinardeer · 4 months ago

"Who was the President of the United States when Neil Armstrong walked on the moon?"

Gemini 2.5 refuses to answer this because it is too political.

staticman2 · 4 months ago

Gemini 2.5 is not generating that refusal. It's a separate censorship model.

It's more clear when you try via AI studio where that have censorship level toggles.

croemer · 4 months ago

I call bs on this: https://g.co/gemini/share/ed38e9d38b02

throwaway314155 · 4 months ago

> one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

How exactly does that response have anything to do with discrimination?

xnx · 4 months ago

Gemini gets the new moon right. Better to use one good model than 5 worse ones.

kenjackson · 4 months ago

I think all the full power LLMs will get it right because they do web search. ChatGPT 4 does as well.

erikw · 4 months ago

Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.

danpalmer · 4 months ago

Are you sure about all of this? You acknowledged it might be a hallucination, but you seem to mostly believe it? o3 doesn't have the ability to spin up a VM.

https://xcancel.com/TransluceAI/status/1912552046269771985 / https://news.ycombinator.com/item?id=43713502 is a discussion of these hallucinations.

As for the hash, could it have simply found a listing for the package with hashes provided and used that hash?

tymscar · 4 months ago

Thats so different from my experience. I tried to have it switch a flake for a yarn package that works to npm and after 3 tries with all the hints I could give it it couldn’t do it

bool3max · 4 months ago

I find that so incredibly unlikely. Granted I haven't been keeping up to date with the latest LLM developments - but has there even been any actual confirmation from OpenAI that these models have the ability to do such things in the background?

peterldowns · 4 months ago

If it can write a nixos flake it's significantly smarter than the average programmer. Certainly smarter than me, one-shotting a flake is not something I'll ever be able to do — usually takes me about thirty shots and a few minutes to cool off from how mad I am at whoever designed this fucking idiotic language. That's awesome.

ZeroTalent · 4 months ago

I was a major contributor of Flake. What in particular is so idiotic in your opinion?

Deleted Comment

brailsafe · 4 months ago

I mean, a smart programmer still has to learn what NixOs and Flakes are, and based on your description and some cursory searching, a smart programmer would just go do literally anything else. Perfect thing to delegate to a machine that doesn't have to worry about motivation.

Just jokes, idk anything about either.

ai-christianson · 4 months ago

> Interesting... I asked o3 for help writing...

What tool were you using for this?

georgewsinger · 4 months ago

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

jjani · 4 months ago

Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

unsupp0rted · 4 months ago

Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.

armen52 · 4 months ago

I don't understand this assertion, but maybe I'm missing something?

Google included a SWE-bench score of 63.8% in their announcement for Gemini 2.5 Pro: https://blog.google/technology/google-deepmind/gemini-model-...

amedviediev · 4 months ago

I keep seeing this sentiment so often here and on X that I have to wonder if I'm somehow using a different Gemini 2.5 Pro. I've been trying to use it for a couple of weeks already and without exaggeration it has yet to solve a single programming task successfully. It is constantly wrong, constantly misunderstands my requests, ignores constraints, ignores existing coding conventions, breaks my code and then tells me to fix it myself.

spaceman_2020 · 4 months ago

I feel that Claude 3.7 is smarter, but does way too much and has poor prompt adherence

redox99 · 4 months ago

2.5 Pro is very buggy with cursor. It often stops before generating any code. It's likely a cursor problem, but I use 3.7 because of that.

saberience · 4 months ago

Eh, I wouldn't say that's accurate, I think it's situational. I code all day using AI tools and Sonnet 3.7 is still the king. Maybe it's language dependent or something, but all the engineers I know are full on Claude-Code at this point.

pizzathyme · 4 months ago

The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.

There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.

mchusma · 4 months ago

Thanks for sharing that. that was more interesting then their demo. I tried it and it was pretty good! I have felt that the ability to iterate from images blocked this from any real production use I had. This may be good enough now.

Example of edits (not quite surgical but good): https://chatgpt.com/share/68001b02-9b4c-8012-a339-73525b8246...

ilaksh · 4 months ago

wait, o4-mini outputs images? What I thought I saw was the ability to do a tool call to zoom in on an image.

Are you sure that's not 4o?

Agentus · 4 months ago

also another addition: i previously tried to upload an image for chatgpt to edit and it was incapable under the previous model i tried. Now its able to change uploaded images using o4mini.

oofbaroomf · 4 months ago

Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.

[0] swebench.com/#verified

Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:

> For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.

Arguably this shouldn't be counted though?

[1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

awestroke · 4 months ago

OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt

swyx · 4 months ago

they also gave more detail on their SWEBench scaffolding here https://www.latent.space/p/claude-sonnet

lattalayta · 4 months ago

I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks

mickael-kerjean · 4 months ago

The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:

emp17344 · 4 months ago

That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.

knes · 4 months ago

Right now the Swe-Bench leader Augment Agent still use Claude 3.7 in combo with o1. https://www.augmentcode.com/blog/1-open-source-agent-on-swe-...

The findings are open sourced on a repo too https://github.com/augmentcode/augment-swebench-agent

thefourthchime · 4 months ago

Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.

ksec · 4 months ago

I often wonder if we could expect that to reach 80% - 90% within next 5 years.

osigurdson · 4 months ago

I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).

Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.

sebzim4500 · 4 months ago

Unless I'm misunderstanding what you are asking the model to do, Gemini 2.5 pro just passed this easily. https://g.co/gemini/share/e2876d310914

As I mentioned, this is not a scientific test but rather just something that I have tried from time to time and has always (shockingly in my opinion) failed but today worked. It takes a minute of two of prompting, is boring to verify and I don't remember exactly which models I have used. It is purely a personal anecdote, nothing more.

However, looking at the code that Gemini wrote in the link, it does the same thing that other LLMs often do, which is to assume that we are encoding individual long values. I assume there must be a github repo or stackoverflow question in the weights somewhere that is pushing it in this direction but it is a little odd. Naturally, this isn't the kind encoder that someone would normally want. Typically it should encode a byte array and return a string (or maybe encode / decode UTF8 strings directly). Having the interface use a long is very weird and not very useful.

In any case, I suspect with a bit more prompting you might be able to get gemini to do the right thing.

AaronAPU · 4 months ago

I’ve been using Gemini 2.5 pro side by side with o1-pro and Grok lately. My experience is they each randomly offer significant insight the other two didn’t.

But generally, o1-pro listens to my profile instructions WAY better, and it seems to be better at actually solving problems the first time. More reliable.

But they are all quite similar and so far these new models are similar but faster IMO.

I asked o3 to build and test a maximum parsimony phylogenetic tree builder in Python (my standard test for new models) and it's been thinking for 10 minutes. Still not clear if anything is happening, I have barely seen any code since I asked to test what it produced in the first answer. The thought summary is totally useless compared to Gemini's. Underwhelming so far.

The CoT summary is full of references to Jupyter notebook cells. The variable names are too abbreviated, nbr for neighbor, the code becomes fairly cryptic as a result, not nice to read. Maybe optimized too much for speed.

Also I've noticed ChatGPT seems to abort thinking when I switch away from the app. That's stupid, I don't want to look at a spinner for 5 minutes.

And the CoT summary keeps mentioning my name which is irritating.

istjohn · 4 months ago

It's maddening that you can't switch away from the app while it generates output. To use the Deep Research feature on mobile, you have to give up your phone for ten minutes.

beefnugs · 4 months ago

Have you tried cutting the job up into a series of smaller verifiable intermediate steps?

NiloCK · 4 months ago

I could be misinterpreting your claim here, but I'll point out that LLM weights don't literally encode the entirety of the training data set.

glial · 4 months ago

I guess you could consider it a lossy encoding.

andrethegiant · 4 months ago

Buried in the article, a new CLI for coding:

> Codex CLI is fully open-source at https://github.com/openai/codex today.

dang · 4 months ago

Related ongoing thread:

OpenAI Codex CLI: Lightweight coding agent that runs in your terminal - https://news.ycombinator.com/item?id=43708025

ipsum2 · 4 months ago

Looks like a Claude Code clone.

jumpCastle · 4 months ago

But open source like aider

zapnuk · 4 months ago

Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.

Lets see what the pricing looks like.

Workaccount2 · 4 months ago

Looks like they are taking a page from Apple's book, which is to never even acknowledge other products exist outside your ecosystem.

stogot · 4 months ago

Apple has commercials for a decade making fun of “PCs”

They didn't provide a comparison either in the GPT-4.1 release and quite a few past releases, which is telling of their attitude as an org.

BeetleB · 4 months ago

Pricing is already available:

https://platform.openai.com/docs/pricing

carlita_express · 4 months ago

> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.

Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?

anothermathbozo · 4 months ago

This isn't exactly the case. The trend is a log scale. So a 10x in pretraining should yield a 10% increase in performance. That's not proving to be false per say but rather they are encountering practical limitations around 10x'ing data volume and 10x'ing available compute.

I am aware of that, like I said:

> (Or at least because O(log) increases in model performance became unreasonably costly?)

But, yes, I left implicit in my comment that the trend might be “fleeting” because of its impracticality. RL is only a trend so long as it is fashionable, and only fashionable (i.e., practical) so long as OpenAI is fed an exponential amount of VC money to ensure linear improvements under O(log) conditions.

OpenAI is selling to VCs the idea that some hitherto unspecified amount of linear model improvement will kick off productivity gains greater than their exponentially increasing investment. These productivity gains would be no less than a sizeable percentage of American GDP, which Altman has publicly set as his target. But as the capital required increases exponentially, the gap between linearly increasing model capability (i.e., its productivity) and the breakeven ROI target widens. The bigger model would need to deliver a non-linear increase in productivity to justify the exponential price tag.

og_kalu · 4 months ago

It doesn't need to hold forever or even 'much longer' depending on your definition of that duration. It just needs to hold on long enough to realize certain capabilities.

Will it ? Who knows. But seeing as this is something you can't predict ahead of time, it makes little sense not to try in so far as the whole thing is still feasible.