GPT-5-Codex - Readit News

Interesting, the new model's prompt is ~half the size (10KB vs. 23KB) of the previous prompt[0][1].

SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).

As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite in (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.

Additionally, they claim the new model is more steerable (both with AGENTS.md and generally). In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!

[0]https://github.com/openai/codex/blob/main/codex-rs/core/gpt_...

[1]https://github.com/openai/codex/blob/main/codex-rs/core/prom...

tedsanders · 5 months ago

> SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors

SWE-bench is a great eval, but it's very narrow. Two models can have the same SWE-bench scores but very different user experiences.

Here's a nice thread on X about the things that SWE-bench doesn't measure:

https://x.com/brhydon/status/1953648884309536958

dwaltrip · 5 months ago

so annoying you cant read replies without an account nowadays

pants2 · 5 months ago

Interestingly, "more steerable" can sometimes be a bad thing, as it will tend to follow your prompt to the letter even if that's against your interests. It requires better prompting and generally knowing what you're doing - might be worse for vibe-coders and better for experienced SWEs.

jumploops · 5 months ago

Yes, given a similarly sparse prompt, Claude Code seems to perform "better" because it eagerly does things you don't necessarily know to ask

GPT-5 may underwhelm with the same sparse prompt, as it seems to do exactly what's asked, not more

You can still "fully vibe" with GPT-5, but the pattern works better in two steps:

1. Plan (iterate on high-level spec/PRD, split into actions)

2. Build (work through plans)

Splitting the context here is important, as any LLM will perform worse as the context gets more polluted.

htrp · 5 months ago

think they're indexing here for professional work (people in the VSCode terminal)

siva7 · 5 months ago

So you're all saying suddenly codex cli w gpt 5 codex is better than claude code? Hard to believe

jumploops · 5 months ago

Not suddenly, it's been better since GPT-5 launched.

Prompting is different, but in a good way.

With Claude Code, you can use less prompting, and Claude will get token happy and expand on your request. Great for greenfield/vibing, bad for iterating on existing projects.

With Codex CLI, GPT-5 seems to handle instructions much more precisely. It won't just go off on it's own and do a bunch of work, it will do what you ask.

I've found that being more specific up-front gets better results with GPT-5, whereas with Claude, being more specific doesn't necessarily stop the eagerness of it's output.

As with all LLMs, you can't compare apples to oranges, so to clarify, my experiences are primarily with Typescript and Rust codebases.

strangescript · 5 months ago

Its been better for awhile, people are sleeping on it, just like they slept on claude code when it initially came out.

wahnfrieden · 5 months ago

It is 100% true. And they are rapidly losing users to Codex. Charts were shared recently showing a massive migration underway.

barrenko · 5 months ago

People are using claude code + glm models as alternative too, some complaints flying around.

groby_b · 5 months ago

Small suggestion on refactors into packages: Move the files manually. Just tell codex "they used to be in different locations, fix it up so it builds".

It seems that the concept of file moving isn't something Codex (and other clis) handle well yet. (Same goes for removing. I've ~never seen success in tracking moves and removes in the git commit if I ask for one)

artemisart · 5 months ago

Does refactoring mean moving things around for people? Why don't you use your IDE for this, it already handles fixing imports (or use find-replace) and it's faster and deterministic.

j45 · 5 months ago

I wonder if this means part of the prompt has been moved to a higher level somehow... or baked into the bread elsewhere.

I've been a hardcore claude-4-sonnet + Cursor fan for a long time, but in the last 2 months my usage went through the roof. I started with the basic Cursor subscription, then upgraded to pro, until I hit usage limits again. Then I started using my own Claude API key but I was still paying ~70$ / 5 days, which is not that sustainable for me. But since grok-code-fast-1 landed, I've been using it daily with Cursor and it's fantastic, fast and cheap (free so far). I've also been using GPT-5 lately through the official Codex VSCode extension, and it blows my mind. Last night I used gpt-5-medium to help me heavily refactor a react-native app, improved it's structure and overall performance, something that would've taken me at least 2 days. Now I'm testing out gpt-5-medium-codex, asked it to restructure the entire app routing, and it seems it makes a lot of tool calls, understands, executes commands, it's very organized. Overall my stack from now on is Cursor + grok-code-fast-1 for daily use, and Codex/GPT when I need the brainz. Worth noting that I abused gpt-5-medium all day long yesterday, and I never hit any kind of limit (I just used by ChatGPT Plus account), reason for which I thank the OpenAI team

heymijo · 5 months ago

What exactly did your work flow look like for the gpt-5-medium refactor you did?

I don't have a test like that on hand so I'm really curious what all you prompted the model, what it suggested, and how much your knowledge as a SWE enabled that workflow.

I'd like a more concrete understanding if the mind blowing nature is attainable for any average SWE, an average Joe that tinkers, or only a top decile engineer.

dmix · 5 months ago

I also hit Cursor usage limits for the first time in a year. Hit limits on Claude, GPT, and then it started using Grok :)

I chose to turn on Cursor's pay per usage within the Pro plan (so I paid $25, $20+$5 usage, instead of upgrading to $60/m) in order to keep using Claude because it's faster than Grok

xwowsersx · 5 months ago

I've landed on more or less the same. grok-code-fast-1 has been working well for most coding tasks. I use it in opencode (I guess it's free for some amount of time? Because I haven't added any grok keys ¯\_(ツ)_/¯)

preaching5271 · 5 months ago

robotswantdata · 5 months ago

Codex CLI IDE just works, very impressed with the quality. If you tried it a while back and didn’t like it, try it again via the vscode extension generous usage included with plus.

Ditched my Claude code max sub for the ChatGPT pro $200 plan. So much faster, and not hit any limits yet.

faangguyindia · 5 months ago

i just use aider + gemini pro, here's the project i developed: https://aretecodex.pages.dev/tools/

aitchnyu · 5 months ago

Did anybody switch away from Aider for a compelling alternative? What was the feature?

raincole · 5 months ago

Why not Gemini CLI tho?

steinvakt2 · 5 months ago

Im using Cursor with the $20 plan and hit rate limits after 15 days (so im paying extra the rest of the month). What do you recommend I do?

You could get two plus accounts? Or maybe a business account with two seats?

The $200 pro feels good value personally.

what · 5 months ago

>CLI IDE

What?

Codex cli Vscode extension https://developers.openai.com/codex/ide

poszlem · 5 months ago

Wait, what? They now allow claude code like subscription instead of the API too?

Yes for at least a month. Download the vscode extension and sign in with ChatGPT

Tiberium · 5 months ago

Yes, just do "codex login" and it'll use your ChatGPT subscription.

klipklop · 5 months ago

From my observation of the past 2 weeks is that Claude Code is getting dramatically worse and super low usage quota's while OpenAI Codex is getting great and has a very generous usage quota in comparison.

For people that have not tried it in say ~1 month, give Codex CLI a try.

SecretDreams · 5 months ago

All that matters to the end users is to never be trapped. Cross-shop these products around constantly and go for lowest price, highest performance ratios. We've seen over the last year all companies trade blows, but none are offering something novel within the current space. There is no reason to "stick to one service". But the services will try very hard to keep you stuck for that SaaS revenue.

theshrike79 · 5 months ago

Does it still go "your project is using git, let me just YOLO stuff" on first startup?

My essentials for any coding agent are proper whitelists for allowed commands (you can run uv run <anything>, but rm requires approval every time) and customisable slash commands.

I can live without hooks and subagents.

dns_snek · 5 months ago

> proper whitelists for allowed commands (you can run uv run <anything>, but rm requires approval every time)

This is a nearly impossible problem to solve.

uv run rm *

Sandboxing and limiting its blast radius is the only reliable solution. For more inspiration: https://gtfobins.github.io/

twalichiewicz · 5 months ago

It's been interesting reading this thread and seeing that others have also switched to using Codex over Claude Code. I kept running into a huge issue with Claude Code creating mock implementations and general fakery when it was overwhelmed. I spent so much time tuning my input prompt just to keep it from making things worse that I eventually switched.

Granted, it's not an apples-to-apples comparison since Codex has the advantage of working in a fully scaffolded codebase where it only has to paint by numbers, but my overall experience has been significantly better since switching.

1) create plan in plan-mode

2) ask it to implement plan

That's the way to work with Claude.

Other systems don't have a bespoke "planning" mode and there you need to "tune your input prompt" as they just rush in to implementation by guessing what you wanted

epolanski · 5 months ago

Question, how do I get the equivalent of Claude's "normal mode" in Codex CLI?

It is super annoying that it either vibe codes and just edits and use tools, or it has a plan mode, but no in-between where it asks me whether it's fine it does a or b.

I'm not understanding why it lacks such a capability, why in the world would I want to choose between having to copy paste the edits or auto accept them by default...

Usually I give it a prompt that includes telling it to formulate a plan and not not do any coding until I approve. I will usually do several loops of that before I give it the instruction to go forward with the plan. I like to copy and paste the plan elsewhere because at times these LLM's can "forget" the plan. I usually do testing at each major milestone (either handed off to me or do builds/unit tests.)

Yeah, no way I'm doing copy pasting or allowing it to vibe it.

I want it to help me come up with a plan, execute and check and edit every single edit but with the UX offered by claude, Codex is simply atrocious, I regret spending 23 euros on this.

I see the visual studio code extension does offer something like this, but the UX/UI is terrible, doesn't OAI have people testing those things?

The code is unreadable in that small window[1], doesn't show the lines above/below, it doesn't have IDE tooling (can't inspect types e.g.).

https://i.imgur.com/mfPpMlI.png

This is just not good, that's the kind of AI that slows me, doesn't help at all.

stopachka · 5 months ago

Very impressive. I've been working on a shared background presence animation, and have been testing out Claude and Codex. (By shared presence, I mean imagine a page's background changing based on where everyone's cursor is)

Both were struggling yesterday, with Claude being a bit ahead. Their biggest problems came with being "creative" (their solutions were pretty "stock"), and they had trouble making the simulation.

Tried the same problem on Codex today. The design it came up with still felt a bit lackluster, but it did _a lot_ better on the simulation.

M4v3R · 5 months ago

> Their biggest problems came with being "creative" (their solutions were pretty "stock")

LLM designed UIs will always look generic/stock if you don’t give it additional prompting because of how LLMs work - they’ve memorized certain design patterns and if you don’t specify what you want they will always default to a certain look.

Try adding additional UI instructions to your prompts. Tell it what color scheme you want, what design choices you prefer, etc. Or tell it to scan your existing app’s design and try to match it. Often the results will be much better this way.

Only an 1.7% upgrade on SWE-Bench compared to GPT-5, but 33.9 vs 51.3% on their internal code refactoring benchmark. This seems like an Opus 4.1-like upgrade, which is nice to see and means they're serious about Codex.