Interesting, the new model's prompt is ~half the size (10KB vs. 23KB) of the previous prompt[0][1].
SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).
As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite in (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.
Additionally, they claim the new model is more steerable (both with AGENTS.md and generally). In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!
Interestingly, "more steerable" can sometimes be a bad thing, as it will tend to follow your prompt to the letter even if that's against your interests. It requires better prompting and generally knowing what you're doing - might be worse for vibe-coders and better for experienced SWEs.
Not suddenly, it's been better since GPT-5 launched.
Prompting is different, but in a good way.
With Claude Code, you can use less prompting, and Claude will get token happy and expand on your request. Great for greenfield/vibing, bad for iterating on existing projects.
With Codex CLI, GPT-5 seems to handle instructions much more precisely. It won't just go off on it's own and do a bunch of work, it will do what you ask.
I've found that being more specific up-front gets better results with GPT-5, whereas with Claude, being more specific doesn't necessarily stop the eagerness of it's output.
As with all LLMs, you can't compare apples to oranges, so to clarify, my experiences are primarily with Typescript and Rust codebases.
Small suggestion on refactors into packages: Move the files manually. Just tell codex "they used to be in different locations, fix it up so it builds".
It seems that the concept of file moving isn't something Codex (and other clis) handle well yet. (Same goes for removing. I've ~never seen success in tracking moves and removes in the git commit if I ask for one)
Does refactoring mean moving things around for people? Why don't you use your IDE for this, it already handles fixing imports (or use find-replace) and it's faster and deterministic.
I've been a hardcore claude-4-sonnet + Cursor fan for a long time, but in the last 2 months my usage went through the roof. I started with the basic Cursor subscription, then upgraded to pro, until I hit usage limits again. Then I started using my own Claude API key but I was still paying ~70$ / 5 days, which is not that sustainable for me. But since grok-code-fast-1 landed, I've been using it daily with Cursor and it's fantastic, fast and cheap (free so far). I've also been using GPT-5 lately through the official Codex VSCode extension, and it blows my mind. Last night I used gpt-5-medium to help me heavily refactor a react-native app, improved it's structure and overall performance, something that would've taken me at least 2 days. Now I'm testing out gpt-5-medium-codex, asked it to restructure the entire app routing, and it seems it makes a lot of tool calls, understands, executes commands, it's very organized. Overall my stack from now on is Cursor + grok-code-fast-1 for daily use, and Codex/GPT when I need the brainz. Worth noting that I abused gpt-5-medium all day long yesterday, and I never hit any kind of limit (I just used by ChatGPT Plus account), reason for which I thank the OpenAI team
What exactly did your work flow look like for the gpt-5-medium refactor you did?
I don't have a test like that on hand so I'm really curious what all you prompted the model, what it suggested, and how much your knowledge as a SWE enabled that workflow.
I'd like a more concrete understanding if the mind blowing nature is attainable for any average SWE, an average Joe that tinkers, or only a top decile engineer.
I also hit Cursor usage limits for the first time in a year. Hit limits on Claude, GPT, and then it started using Grok :)
I chose to turn on Cursor's pay per usage within the Pro plan (so I paid $25, $20+$5 usage, instead of upgrading to $60/m) in order to keep using Claude because it's faster than Grok
I've landed on more or less the same. grok-code-fast-1 has been working well for most coding tasks. I use it in opencode (I guess it's free for some amount of time? Because I haven't added any grok keys ¯\_(ツ)_/¯)
Codex CLI IDE just works, very impressed with the quality. If you tried it a while back and didn’t like it, try it again via the vscode extension generous usage included with plus.
Ditched my Claude code max sub for the ChatGPT pro $200 plan. So much faster, and not hit any limits yet.
From my observation of the past 2 weeks is that Claude Code is getting dramatically worse and super low usage quota's while OpenAI Codex is getting great and has a very generous usage quota in comparison.
For people that have not tried it in say ~1 month, give Codex CLI a try.
All that matters to the end users is to never be trapped. Cross-shop these products around constantly and go for lowest price, highest performance ratios. We've seen over the last year all companies trade blows, but none are offering something novel within the current space. There is no reason to "stick to one service". But the services will try very hard to keep you stuck for that SaaS revenue.
Does it still go "your project is using git, let me just YOLO stuff" on first startup?
My essentials for any coding agent are proper whitelists for allowed commands (you can run uv run <anything>, but rm requires approval every time) and customisable slash commands.
It's been interesting reading this thread and seeing that others have also switched to using Codex over Claude Code. I kept running into a huge issue with Claude Code creating mock implementations and general fakery when it was overwhelmed. I spent so much time tuning my input prompt just to keep it from making things worse that I eventually switched.
Granted, it's not an apples-to-apples comparison since Codex has the advantage of working in a fully scaffolded codebase where it only has to paint by numbers, but my overall experience has been significantly better since switching.
Other systems don't have a bespoke "planning" mode and there you need to "tune your input prompt" as they just rush in to implementation by guessing what you wanted
Question, how do I get the equivalent of Claude's "normal mode" in Codex CLI?
It is super annoying that it either vibe codes and just edits and use tools, or it has a plan mode, but no in-between where it asks me whether it's fine it does a or b.
I'm not understanding why it lacks such a capability, why in the world would I want to choose between having to copy paste the edits or auto accept them by default...
Usually I give it a prompt that includes telling it to formulate a plan and not not do any coding until I approve. I will usually do several loops of that before I give it the instruction to go forward with the plan. I like to copy and paste the plan elsewhere because at times these LLM's can "forget" the plan. I usually do testing at each major milestone (either handed off to me or do builds/unit tests.)
Yeah, no way I'm doing copy pasting or allowing it to vibe it.
I want it to help me come up with a plan, execute and check and edit every single edit but with the UX offered by claude, Codex is simply atrocious, I regret spending 23 euros on this.
I see the visual studio code extension does offer something like this, but the UX/UI is terrible, doesn't OAI have people testing those things?
The code is unreadable in that small window[1], doesn't show the lines above/below, it doesn't have IDE tooling (can't inspect types e.g.).
Very impressive. I've been working on a shared background presence animation, and have been testing out Claude and Codex. (By shared presence, I mean imagine a page's background changing based on where everyone's cursor is)
Both were struggling yesterday, with Claude being a bit ahead. Their biggest problems came with being "creative" (their solutions were pretty "stock"), and they had trouble making the simulation.
Tried the same problem on Codex today. The design it came up with still felt a bit lackluster, but it did _a lot_ better on the simulation.
> Their biggest problems came with being "creative" (their solutions were pretty "stock")
LLM designed UIs will always look generic/stock if you don’t give it additional prompting because of how LLMs work - they’ve memorized certain design patterns and if you don’t specify what you want they will always default to a certain look.
Try adding additional UI instructions to your prompts. Tell it what color scheme you want, what design choices you prefer, etc. Or tell it to scan your existing app’s design and try to match it. Often the results will be much better this way.
Only an 1.7% upgrade on SWE-Bench compared to GPT-5, but 33.9 vs 51.3% on their internal code refactoring benchmark. This seems like an Opus 4.1-like upgrade, which is nice to see and means they're serious about Codex.
SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).
As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite in (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.
Additionally, they claim the new model is more steerable (both with AGENTS.md and generally). In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!
[0]https://github.com/openai/codex/blob/main/codex-rs/core/gpt_...
[1]https://github.com/openai/codex/blob/main/codex-rs/core/prom...
SWE-bench is a great eval, but it's very narrow. Two models can have the same SWE-bench scores but very different user experiences.
Here's a nice thread on X about the things that SWE-bench doesn't measure:
https://x.com/brhydon/status/1953648884309536958
GPT-5 may underwhelm with the same sparse prompt, as it seems to do exactly what's asked, not more
You can still "fully vibe" with GPT-5, but the pattern works better in two steps:
1. Plan (iterate on high-level spec/PRD, split into actions)
2. Build (work through plans)
Splitting the context here is important, as any LLM will perform worse as the context gets more polluted.
Prompting is different, but in a good way.
With Claude Code, you can use less prompting, and Claude will get token happy and expand on your request. Great for greenfield/vibing, bad for iterating on existing projects.
With Codex CLI, GPT-5 seems to handle instructions much more precisely. It won't just go off on it's own and do a bunch of work, it will do what you ask.
I've found that being more specific up-front gets better results with GPT-5, whereas with Claude, being more specific doesn't necessarily stop the eagerness of it's output.
As with all LLMs, you can't compare apples to oranges, so to clarify, my experiences are primarily with Typescript and Rust codebases.
It seems that the concept of file moving isn't something Codex (and other clis) handle well yet. (Same goes for removing. I've ~never seen success in tracking moves and removes in the git commit if I ask for one)
I don't have a test like that on hand so I'm really curious what all you prompted the model, what it suggested, and how much your knowledge as a SWE enabled that workflow.
I'd like a more concrete understanding if the mind blowing nature is attainable for any average SWE, an average Joe that tinkers, or only a top decile engineer.
I chose to turn on Cursor's pay per usage within the Pro plan (so I paid $25, $20+$5 usage, instead of upgrading to $60/m) in order to keep using Claude because it's faster than Grok
Ditched my Claude code max sub for the ChatGPT pro $200 plan. So much faster, and not hit any limits yet.
The $200 pro feels good value personally.
What?
For people that have not tried it in say ~1 month, give Codex CLI a try.
My essentials for any coding agent are proper whitelists for allowed commands (you can run uv run <anything>, but rm requires approval every time) and customisable slash commands.
I can live without hooks and subagents.
This is a nearly impossible problem to solve.
uv run rm *
Sandboxing and limiting its blast radius is the only reliable solution. For more inspiration: https://gtfobins.github.io/
Granted, it's not an apples-to-apples comparison since Codex has the advantage of working in a fully scaffolded codebase where it only has to paint by numbers, but my overall experience has been significantly better since switching.
2) ask it to implement plan
That's the way to work with Claude.
Other systems don't have a bespoke "planning" mode and there you need to "tune your input prompt" as they just rush in to implementation by guessing what you wanted
It is super annoying that it either vibe codes and just edits and use tools, or it has a plan mode, but no in-between where it asks me whether it's fine it does a or b.
I'm not understanding why it lacks such a capability, why in the world would I want to choose between having to copy paste the edits or auto accept them by default...
I want it to help me come up with a plan, execute and check and edit every single edit but with the UX offered by claude, Codex is simply atrocious, I regret spending 23 euros on this.
I see the visual studio code extension does offer something like this, but the UX/UI is terrible, doesn't OAI have people testing those things?
The code is unreadable in that small window[1], doesn't show the lines above/below, it doesn't have IDE tooling (can't inspect types e.g.).
https://i.imgur.com/mfPpMlI.png
This is just not good, that's the kind of AI that slows me, doesn't help at all.
Both were struggling yesterday, with Claude being a bit ahead. Their biggest problems came with being "creative" (their solutions were pretty "stock"), and they had trouble making the simulation.
Tried the same problem on Codex today. The design it came up with still felt a bit lackluster, but it did _a lot_ better on the simulation.
LLM designed UIs will always look generic/stock if you don’t give it additional prompting because of how LLMs work - they’ve memorized certain design patterns and if you don’t specify what you want they will always default to a certain look.
Try adding additional UI instructions to your prompts. Tell it what color scheme you want, what design choices you prefer, etc. Or tell it to scan your existing app’s design and try to match it. Often the results will be much better this way.