I've been extremely impressed (and actually had quite a good time) with GPT-5 and Codex so far. It seems to handle long context well, does a great job researching the code, never leaves things half-done (with long tasks it may leave some steps for later, but it never does 50% of a step and then just randomly mock a function like Gemini used to), and gives me good suggestions if I'm trying to do something I shouldn't. And the Codex CLI also seems to be getting constant, meaningful updates.
Agreed. We're hardcore Claude Code users and my CC usage trended down to zero pretty quickly after I started using Codex. The new model updates today are great. Very well done OpenAI team!! CC was an existential threat. You responded and absolutely killed it. Your move Anthropic.
To be fair, Anthropic kinda did this to themselves. I consider it as a pretty massive throw on their end in terms of the fairly tight grasp they had on developer sentiment.
Everyone else slowly caught up and/or surpassed them while they simultaneously had quality control issues and service degradation plaguing their system - ALL while having the most expensive models comparatively in terms of intelligence.
I would sincerely like to understand what your steps were to get you to convincingly move down to zero usage of CC. I have seen hits and misses with codex to feel like it tries really hard to be good, and in some ways it is (like the out-of-the-box context management seems like a pretty smooth batteries included feature), but in some important (to me) ways, it just keeps falling on its face (like giving up on what it deems to be too complex of a task-in my case, porting a pretty robust JS deobfuscation tool (works but is mad slow) over to Rust-and that has prevented me from feeling so full of confidence and speculative joy about, thus far. It caught and fixed some bugs after a few turns of renewing context but I was doing that with CC (with better walkthroughs as it did its thing) so it felt underwhelming to me. As anecdotal as my situation/experience sounds, I still feel like with every "new"-ish thing that gets thrown at us regarding Ai tooling and similar such news, the hype does not live up to the reality, FOR ME.
This just goes to show how crucial it was for Anthropic and OpenAI to hire first class product leads. You can’t just pay the AI engineers $100M. Models alone don’t generate revenue.
- The smartest model I have used. Solves problems better than Opus-4.1.
- It can be lazy. With Claude Code / Opus, once given a problem, it will generally work until completion. Codex will often perform only the first few steps and then ask if I want to continue to do the rest. It does this even if I tell it to not stop until completion.
- I have seen severe degradation near max context. For example, I have seen it just repeat the next steps every time I tell it to continue and I have to manually compact.
I'm not sure if the problems are Gpt-5 or Codex. I suspect a better Codex could resolve them.
Claude seems to have gotten worse for me, with both that kind of laziness and a new pattern where it will write the test, write the code, run the test, and then declare that the test is working perfectly but there are problems in the (new) code that need to be fixed.
Context degradation is a real problem with all frontier LLMs. As a rule of thumb I try to never exceed 50% of available context window when working with either Claude Sonnet 4 or GPT-5 since the quality drops really fast from there.
Yes, this is the one thing stopping me from going to Codex completely.
Currently, it's kind of annoying that Codex stops often and asks me what to do, and I just reply "continue". Even though I already gave it a checklist.
With GPT‑5-Codex they do write: "During testing, we've seen GPT‑5-Codex work independently for more than 7 hours at a time on large, complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation."
https://openai.com/index/introducing-upgrades-to-codex/
I definitely agree with all of those points. I just really prefer it completing steps and asking me if we should continue to next step rather than doing half of the step and telling me it's done. And the context degradation seems quite random - sometimes it hits way earlier, sometimes we go through crazy amount of tokens and it all works out.
I also noticed the laziness compared to Sonnet models but now I feel it’s a good feature. Sonnet models, now I realize, are way too eager to hammer out code with way more likelihood of bugs.
Can someone compare it to cursor? So far i see people compare it with Claude code but I’ve had much more success and cost effectiveness with cursor than Claude code
Doesn’t compare, because Cursor has a privacy mode. Why would anyone want to pay OpenAI or Anthropic to train their bots on your business codebase? You know where that leads? Unemployment!
It doesn't seem to have any internal tools it can use. For example, web search; It just runs curl in the terminal. Compared to Gemini CLI that's rough but it does handle pasting much better... Maybe I'm just using both wrong...
It does have web search - it's just not enabled by default. You can enable it with --search or in the config, then it can absolutely search, for example finding manuals/algorithms.
This should probably be merged with the other GPT-5-Codex thread at https://news.ycombinator.com/item?id=45252301 since nobody in this thread is talking about the system card addendum.
My problem _still_ with all of the codex/gpt based offerings is that they think for way too long. After using Claude 4 models through cursor max/ampcode I feel much more effective given it's speed. Ironically, Claude Code feels just as slow as codex/gpt (even with my company patching through AWS bedrock). Only makes me feel more that the consumer modes have perverse incentives.
I almost never have to reprompt GPT-5-high (now gpt-5-codex-high) where I would be reprompting claude code all the time. It feels like its faster, doing more, but its taking more of the developers time by getting things wrong.
It’s great for multitasking. I’ve cloned one of the repos I work on into a new folder and use Codex CLI in there. I feed it bug reports that users have submitted, while I work on bigger tasks.
Interesting, the new model uses a different prompt in Codex CLI that's ~half the size (10KB vs. 23KB) of the previous prompt[0][1].
SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).
As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite it (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.
Additionally, they claim the new model is more steerable (both with AGENTS.md and generally).
In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!
What worked was getting it to first write a detailed implementation plan for a “junior contractor” then attempt it in phases (clearing task window each time) and told to use /tmp to copy files and transform them then update the original.
Looking forward to trying the new model out on the next refactor!
Yes, regardless of tool, I always create a separate plan doc for larger changes
Will try adding the instructions specific to refactors (i.e. copy/move files, don't rewrite when possible)
I've also found it helpful, especially for certain regressions, to basically create a new branch for any Codex/CC assisted task (even if part of a larger task). Makes it easier to identify regressions due to recent changes (i.e. look at git diff, it worked previously)
Telling the "agent" to manage git leads to more context pollution than I want, so I manage all commits/branches myself, but I'm sure that will change as the tools improve/they do more RL on full-cycle software dev
It would be nice if this model would be good enough to update their typscript sdk (+agents library) to use, or at least support, zod v4 - they still use v3.
Had to spend quite a long time to figure out a dependency error...
I've had great results with Codex, though I found ChatGPT 5 was giving much better results than the existing model. So ended up using that directly instead. So very excited to have the model upgraded in Codex itself.
The main issues with Codex now seem to be the very poor stability (it seems to be down almost 50% of the time) and lack of custom containers. Hoping those get solved soon, particularly the stability.
I also wonder where the price will end up, it currently seems unsustainably cheap.
Anyone can share their thoughts on Claude Code vs Codex?
I've just started out trying out Claude Code and am not sure how Codex compares on React projects.
From my initial usage, it seems Claude Code planning mode is superior than its normal? mode, and giving it an overall direction to proceed and rather than just stating a desired feature seems to produce better results. It also does better if a large task are split into very small sub-tasks.
I've used Claude Code for about 3 months now. Was a big fan until recent changes lobotomized it. So I switched over to codex about 2 weeks ago and loving it so far, way better experience. Today with the introduction of the new model, i been refactoring old claude code project all day and so far things are looking good. I am very impressed, OpenAI cooked hard here...
Everyone else slowly caught up and/or surpassed them while they simultaneously had quality control issues and service degradation plaguing their system - ALL while having the most expensive models comparatively in terms of intelligence.
It's super annoying that it doesn't provide a way to approve edits one by one instead it either vibe codes on its own or gives me diffs to copy paste.
Claude code has a much saner "normal mode".
- The smartest model I have used. Solves problems better than Opus-4.1.
- It can be lazy. With Claude Code / Opus, once given a problem, it will generally work until completion. Codex will often perform only the first few steps and then ask if I want to continue to do the rest. It does this even if I tell it to not stop until completion.
- I have seen severe degradation near max context. For example, I have seen it just repeat the next steps every time I tell it to continue and I have to manually compact.
I'm not sure if the problems are Gpt-5 or Codex. I suspect a better Codex could resolve them.
Very frustrating, and happening more often.
With GPT‑5-Codex they do write: "During testing, we've seen GPT‑5-Codex work independently for more than 7 hours at a time on large, complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation." https://openai.com/index/introducing-upgrades-to-codex/
Not sure the fault it's "doing bad code", I guess it's just not being good at being agentic. Saw this on Gemini CLI and other tools.
GLM, Kimi, Qwen-Code all behaves better for me.
Probably Gemini 3 will fix this, as Gemini 2.5 Pro it's "old" by now.
Gemini cli is too inconsistent, good for documentation tasks. Don’t let it write code for you
Claude Code does that on longer tasks.
Time to give Codex a try I guess.
Dead Comment
SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).
As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite it (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.
Additionally, they claim the new model is more steerable (both with AGENTS.md and generally).
In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!
[0]https://github.com/openai/codex/blob/main/codex-rs/core/gpt_...
[1]https://github.com/openai/codex/blob/main/codex-rs/core/prom...
(comment reposted from other thread)
What worked was getting it to first write a detailed implementation plan for a “junior contractor” then attempt it in phases (clearing task window each time) and told to use /tmp to copy files and transform them then update the original.
Looking forward to trying the new model out on the next refactor!
Will try adding the instructions specific to refactors (i.e. copy/move files, don't rewrite when possible)
I've also found it helpful, especially for certain regressions, to basically create a new branch for any Codex/CC assisted task (even if part of a larger task). Makes it easier to identify regressions due to recent changes (i.e. look at git diff, it worked previously)
Telling the "agent" to manage git leads to more context pollution than I want, so I manage all commits/branches myself, but I'm sure that will change as the tools improve/they do more RL on full-cycle software dev
But when I installed Codex and tried to make a simple code bugfix, I got rate limited nearly immediately. As in, after 3 "steps" the agent took.
Are you meant to only use Codex with their $200 "unlimited" plans? Thanks!
Had to spend quite a long time to figure out a dependency error...
The main issues with Codex now seem to be the very poor stability (it seems to be down almost 50% of the time) and lack of custom containers. Hoping those get solved soon, particularly the stability.
I also wonder where the price will end up, it currently seems unsustainably cheap.
Jetbrains has a $30/mo subscription (with gpt5 backend) and the quota burns so fast.
Assuming Jetbrains is at breakeven price, either OpenAI has some secret sauce or they're losing money for Codex.
I've just started out trying out Claude Code and am not sure how Codex compares on React projects.
From my initial usage, it seems Claude Code planning mode is superior than its normal? mode, and giving it an overall direction to proceed and rather than just stating a desired feature seems to produce better results. It also does better if a large task are split into very small sub-tasks.