Addendum to GPT-5 system card: GPT-5-Codex

bayesianbot · 3 months ago

I've been extremely impressed (and actually had quite a good time) with GPT-5 and Codex so far. It seems to handle long context well, does a great job researching the code, never leaves things half-done (with long tasks it may leave some steps for later, but it never does 50% of a step and then just randomly mock a function like Gemini used to), and gives me good suggestions if I'm trying to do something I shouldn't. And the Codex CLI also seems to be getting constant, meaningful updates.

mmaunder · 3 months ago

Agreed. We're hardcore Claude Code users and my CC usage trended down to zero pretty quickly after I started using Codex. The new model updates today are great. Very well done OpenAI team!! CC was an existential threat. You responded and absolutely killed it. Your move Anthropic.

Jcampuzano2 · 3 months ago

To be fair, Anthropic kinda did this to themselves. I consider it as a pretty massive throw on their end in terms of the fairly tight grasp they had on developer sentiment.

Everyone else slowly caught up and/or surpassed them while they simultaneously had quality control issues and service degradation plaguing their system - ALL while having the most expensive models comparatively in terms of intelligence.

notfromhere · 3 months ago

Gpt5 writes clean, simple code and listens to instructions. I went from tons of Claude APi usage to usage to basically none overnight

codehead · 3 months ago

I would sincerely like to understand what your steps were to get you to convincingly move down to zero usage of CC. I have seen hits and misses with codex to feel like it tries really hard to be good, and in some ways it is (like the out-of-the-box context management seems like a pretty smooth batteries included feature), but in some important (to me) ways, it just keeps falling on its face (like giving up on what it deems to be too complex of a task-in my case, porting a pretty robust JS deobfuscation tool (works but is mad slow) over to Rust-and that has prevented me from feeling so full of confidence and speculative joy about, thus far. It caught and fixed some bugs after a few turns of renewing context but I was doing that with CC (with better walkthroughs as it did its thing) so it felt underwhelming to me. As anecdotal as my situation/experience sounds, I still feel like with every "new"-ish thing that gets thrown at us regarding Ai tooling and similar such news, the hype does not live up to the reality, FOR ME.

epolanski · 3 months ago

But how do you use it?

It's super annoying that it doesn't provide a way to approve edits one by one instead it either vibe codes on its own or gives me diffs to copy paste.

Claude code has a much saner "normal mode".

ttul · 3 months ago

This just goes to show how crucial it was for Anthropic and OpenAI to hire first class product leads. You can’t just pay the AI engineers $100M. Models alone don’t generate revenue.

EnPissant · 3 months ago

My experience with Codex / Gpt-5:

- The smartest model I have used. Solves problems better than Opus-4.1.

- It can be lazy. With Claude Code / Opus, once given a problem, it will generally work until completion. Codex will often perform only the first few steps and then ask if I want to continue to do the rest. It does this even if I tell it to not stop until completion.

- I have seen severe degradation near max context. For example, I have seen it just repeat the next steps every time I tell it to continue and I have to manually compact.

I'm not sure if the problems are Gpt-5 or Codex. I suspect a better Codex could resolve them.

brookst · 3 months ago

Claude seems to have gotten worse for me, with both that kind of laziness and a new pattern where it will write the test, write the code, run the test, and then declare that the test is working perfectly but there are problems in the (new) code that need to be fixed.

Very frustrating, and happening more often.

M4v3R · 3 months ago

Context degradation is a real problem with all frontier LLMs. As a rule of thumb I try to never exceed 50% of available context window when working with either Claude Sonnet 4 or GPT-5 since the quality drops really fast from there.

apigalore · 3 months ago

Yes, this is the one thing stopping me from going to Codex completely. Currently, it's kind of annoying that Codex stops often and asks me what to do, and I just reply "continue". Even though I already gave it a checklist.

With GPT‑5-Codex they do write: "During testing, we've seen GPT‑5-Codex work independently for more than 7 hours at a time on large, complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation." https://openai.com/index/introducing-upgrades-to-codex/

bayesianbot · 3 months ago

I definitely agree with all of those points. I just really prefer it completing steps and asking me if we should continue to next step rather than doing half of the step and telling me it's done. And the context degradation seems quite random - sometimes it hits way earlier, sometimes we go through crazy amount of tokens and it all works out.

tanvach · 3 months ago

I also noticed the laziness compared to Sonnet models but now I feel it’s a good feature. Sonnet models, now I realize, are way too eager to hammer out code with way more likelihood of bugs.

vitorgrs · 3 months ago

Gemini seems to be pretty awful as agentic coding. It always finish the task, and when I see the result, it just breaks my code.

Not sure the fault it's "doing bad code", I guess it's just not being good at being agentic. Saw this on Gemini CLI and other tools.

GLM, Kimi, Qwen-Code all behaves better for me.

Probably Gemini 3 will fix this, as Gemini 2.5 Pro it's "old" by now.

faangguyindia · 3 months ago

Gemini CLI is bad, model itself is really good.

robotswantdata · 3 months ago

Agreed ditched my Claude code max for the $200 pro ChatGPT.

Gemini cli is too inconsistent, good for documentation tasks. Don’t let it write code for you

icelancer · 3 months ago

Gemini's tool calling being so bad is pretty amazing. Hopefully in the next iteration they fix it, because the model itself is very good.

DanielVZ · 3 months ago

Can someone compare it to cursor? So far i see people compare it with Claude code but I’ve had much more success and cost effectiveness with cursor than Claude code

bionhoward · 3 months ago

Doesn’t compare, because Cursor has a privacy mode. Why would anyone want to pay OpenAI or Anthropic to train their bots on your business codebase? You know where that leads? Unemployment!

FergusArgyll · 3 months ago

It doesn't seem to have any internal tools it can use. For example, web search; It just runs curl in the terminal. Compared to Gemini CLI that's rough but it does handle pasting much better... Maybe I'm just using both wrong...

Tiberium · 3 months ago

It does have web search - it's just not enabled by default. You can enable it with --search or in the config, then it can absolutely search, for example finding manuals/algorithms.

gizmodo59 · 3 months ago

Use --search option when you start codex

ollybee · 3 months ago

web search too is off by default

mritchie712 · 3 months ago

Have you used Claude Code? How does it compare?

mmaunder · 3 months ago

It's objectively a big improvement over Claude Code. I'm rooting for anthropic, but they better make a big move or this will kill CC.

troupo · 3 months ago

> then just randomly mock a function like Gemini used to

Claude Code does that on longer tasks.

Time to give Codex a try I guess.

Dead Comment

simonw · 3 months ago

This should probably be merged with the other GPT-5-Codex thread at https://news.ycombinator.com/item?id=45252301 since nobody in this thread is talking about the system card addendum.

hamish-b · 3 months ago

My problem _still_ with all of the codex/gpt based offerings is that they think for way too long. After using Claude 4 models through cursor max/ampcode I feel much more effective given it's speed. Ironically, Claude Code feels just as slow as codex/gpt (even with my company patching through AWS bedrock). Only makes me feel more that the consumer modes have perverse incentives.

strangescript · 3 months ago

I almost never have to reprompt GPT-5-high (now gpt-5-codex-high) where I would be reprompting claude code all the time. It feels like its faster, doing more, but its taking more of the developers time by getting things wrong.

hmottestad · 3 months ago

It’s great for multitasking. I’ve cloned one of the repos I work on into a new folder and use Codex CLI in there. I feed it bug reports that users have submitted, while I work on bigger tasks.

jumploops · 3 months ago

Interesting, the new model uses a different prompt in Codex CLI that's ~half the size (10KB vs. 23KB) of the previous prompt[0][1].

SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).

As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite it (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.

Additionally, they claim the new model is more steerable (both with AGENTS.md and generally).

In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!

[0]https://github.com/openai/codex/blob/main/codex-rs/core/gpt_...

[1]https://github.com/openai/codex/blob/main/codex-rs/core/prom...

(comment reposted from other thread)

faangguyindia · 3 months ago

I do not trust SWE bench, here i am using gemini 2.5 pro and single shot most features: https://www.reddit.com/r/ChatGPTCoding/comments/1nh7bu1/3_ph...

robotswantdata · 3 months ago

saw the same behaviour

What worked was getting it to first write a detailed implementation plan for a “junior contractor” then attempt it in phases (clearing task window each time) and told to use /tmp to copy files and transform them then update the original.

Looking forward to trying the new model out on the next refactor!

jumploops · 3 months ago

Yes, regardless of tool, I always create a separate plan doc for larger changes

Will try adding the instructions specific to refactors (i.e. copy/move files, don't rewrite when possible)

I've also found it helpful, especially for certain regressions, to basically create a new branch for any Codex/CC assisted task (even if part of a larger task). Makes it easier to identify regressions due to recent changes (i.e. look at git diff, it worked previously)

Telling the "agent" to manage git leads to more context pollution than I want, so I manage all commits/branches myself, but I'm sure that will change as the tools improve/they do more RL on full-cycle software dev

sergiotapia · 3 months ago

I signed up to OpenAI, verified my identity, and added my credit card, bought $10 of credits.

But when I installed Codex and tried to make a simple code bugfix, I got rate limited nearly immediately. As in, after 3 "steps" the agent took.

Are you meant to only use Codex with their $200 "unlimited" plans? Thanks!

wahnfrieden · 3 months ago

Use Plus first

sergiotapia · 3 months ago

Thank you - so to confirm Codex _requires_ basically the Plus or $200 plans otherwise it just does not work?

faangguyindia · 3 months ago

why people are willing to jump all these hoops?

zapnuk · 3 months ago

It would be nice if this model would be good enough to update their typscript sdk (+agents library) to use, or at least support, zod v4 - they still use v3.

Had to spend quite a long time to figure out a dependency error...

foft · 3 months ago

I've had great results with Codex, though I found ChatGPT 5 was giving much better results than the existing model. So ended up using that directly instead. So very excited to have the model upgraded in Codex itself.

The main issues with Codex now seem to be the very poor stability (it seems to be down almost 50% of the time) and lack of custom containers. Hoping those get solved soon, particularly the stability.

I also wonder where the price will end up, it currently seems unsustainably cheap.

raincole · 3 months ago

> I also wonder where the price will end up, it currently seems unsustainably cheap.

Jetbrains has a $30/mo subscription (with gpt5 backend) and the quota burns so fast.

Assuming Jetbrains is at breakeven price, either OpenAI has some secret sauce or they're losing money for Codex.

8cvor6j844qw_d6 · 3 months ago

Anyone can share their thoughts on Claude Code vs Codex?

I've just started out trying out Claude Code and am not sure how Codex compares on React projects.

From my initial usage, it seems Claude Code planning mode is superior than its normal? mode, and giving it an overall direction to proceed and rather than just stating a desired feature seems to produce better results. It also does better if a large task are split into very small sub-tasks.

nowittyusername · 3 months ago

I've used Claude Code for about 3 months now. Was a big fan until recent changes lobotomized it. So I switched over to codex about 2 weeks ago and loving it so far, way better experience. Today with the introduction of the new model, i been refactoring old claude code project all day and so far things are looking good. I am very impressed, OpenAI cooked hard here...

arthurcolle · 3 months ago

claude code for first 3-4 months was a monster. it's been optimized