We were heavy users of Claude Code ($70K+ spend per year) and have almost completely switched to codex CLI. I'm doing massive lifts with it on software that would never before have been feasible for me personally, or any team I've ever run. I'll use Claude Code maybe once every two weeks as a second set of eyes to inspect code and document a bug, with mixed success. But my experience has been that initially Claude Code was amazing and a "just take my frikkin money" product. Then Codex overtook CC and is much better at longer runs on hard problems. I've seen Claude Code literally just give up on a hard problem and tell me to buy something off the shelf. Whereas Codex's ability to profoundly increase the capabilities of a software org is a secret that's slowly getting out.
I don't have any relationship with any AI company, and honestly I was rooting for Anthropic, but Codex CLI is just way way better.
Also Codex CLI is cheaper than Claude Code.
I think Anthropic are going to have to somehow leapfrog OpenAI to regain the position they were in around June of this year. But right now they're being handed their hat.
Feels like with every announcement there’s the same comment: “this LLM tool I’m using now is the real deal, the thing I was using previously and spending stupid amounts of money on looked good but failed at XYZ, this new thing is where it’s at”. Rinse and repeat.
Which means it wasn’t true any of the previous times, so why would it be true this time? It feels like an endless loop of the “friendship ended” meme with AI companies.
It’s much more likely commenters are still in the honeymoon hype phase and (again) haven’t found the problems because they’re hyper focused on what the new thing is good at that the previous one wasn’t, ignoring the other flaws. I see that a lot with human relationships as well, where people latch on to new partners because they obviously don’t have the big problem that was a strain on the previous relationship. But eventually something else arises. Rinse and repeat.
People just aren't used to how LLMs and their tools are developed
Depending on the time of the year you can expect fresh updates from any given company on how their new models and tools perform and they'll generally blow the competition out of the water.
The trick is to either realize that your current tools will just become magically better in a few months OR lean in and switch companies as their tools and models update.
GPT-5 is not the final deal, but it's incredibly good as is at coding.
Anecdotal, but it's something completely else in terms of capabilities, ignore it at your own peril, but I think it will profoundly change software development.
From my experience this mostly happened to Antrophic models, and not because of some honeymoon period, but after the introduction of their models, the model quality and the limits are starting to decline.
Many people are complaning about this on HN and Reddit.
I do not have any proof, but there is a pattern, I suppose Antrophic first attracts customers, then starts to optimize costs/margins.
I find Codex CLI to be very good too, but it’s missing tons of features that I use in Claude Code daily that keep me from switching full time.
- Good bash command permission system
- Rollbacks coupled with conversation and code
- Easy switching between approval modes (Claude had a keybind that makes this easy)
- Ability to send messages while it’s working (Codex just queues them up for after it’s done, Claude injects them into the current task)
- Codex is very frustrating when I have to keep allowing it to run the same commands over and over, Claude this works well when I approve it to run a command for the session
- Agents (these are very useful for controlling context)
- A real plan mode (crucial)
- Skills (these are basically just lazy loaded context and are amazing)
- The sandboxing in codex is so confusing, commands fail all the time because they try to log to some system directory or use internet access which is blocked by default and hard to figure out
- Codex prefers python snippets to bash commands which is very hard to permission and audit
When Codex gets to feature parity, I’ll seriously look at switching, but until then it’s just a really good model wrapped in an okay harness
I don't think anyone can reasonably argue against Claude Code being the most full-featured and pleasant to use of the CLI coding agent tools. Maybe some people like the Codex user experience for idiosyncratic reasons, but it (like Gemini CLI) still feels to me rather thrown together - a Claude Clone with a lot of rough edges.
But these CLI tools are still fairly thin wrappers around an LLM. Remember: they're "just an LLM in a while loop with access to tool calls." (I exaggerate, and I love Claude Code's more advanced features like "skills" as much as anyone, but at the core, that's what they are.) The real issue at stake is what is the better LLM behind the agent: is GPT-5 or Sonnet 4.5 better at coding. On that I think opinion is split.
Incidentally, you can run Claude Code with GPT-5 if you want a fair(er) comparison. You need a proxy like LiteLLM and you will have to use the OpenAI api and pay per-token, but it's not hard to do and quite interesting. I haven't used it enough to make a good comparison, however.
Yeah I think the argument is the tooling vs agent. Maybe the OpenAI agent is performing better now, but the tooling is significantly better from anthropic.
The anthropic (ClaudeCode) tooling is best-in-class to me. You listed many features that I have become so reliant on now, that I consider them the Ante that other competitors need to even be considered.
I have been very impressed with the Anthropic agent for code generation and review. I have found the OpenAI agent to be significantly lacking by comparison. But to be fair, the last time I used OpenAI's agent for code was about a month ago, so maybe it has improved recently (not at all unreasonable in this space). But at least a month ago when using them side-by-side the codex CLI was VERY basic compared to the wealth of features and UI in the ClaudeCode CLI. The agents for Claude were also so much better than OpenAI, that it wasn't even close. OpenAI has always delivered me improper code (non-working or invalid) at a very high rate, whereas Claude is generally valid code, the debate is just whether it is the desired way to build something.
I am not sure copying your competitors feature-by-feature is always a good strategy. It can make the onboarding of your competitor's users easier, but lead to a worse product overall.
This is especially the case in a fast moving field such as this. You would not want to get stuck in the same local minimum as your competitor.
I would rather we have competing products that try different things to arrive at a better solution overall.
I'm using opencode which I think is now very close to covering all the functionality of claude code. You can use GPT5 Codex with it along with most other models.
to fix having to approve commands over and over - use windows WSL. codex does not play nice with permissions/approvals on windows. WSL solves that completely
I totally agree. I remember the June magic as well - almost overnight my abilities and throughput were profoundly increased, I had many weeks of late nights in awe and wonder trying things that were beyond my ability to implement technically but within the bounds of my conceptual understanding.
Initially, I found Codex CLI with GPT-5 to be a substitute for Claude Code - now GPT-5 Codex materially surpasses it in my line of work, with a huge asterisk. I work in a niche industry, and Codex has generally poor domain understanding of many of the critical attributes and concepts. Claude happens to have better background knowledge for my tasks, so I've found that Sonnet 4.5 with Claude Code generally does a better job at scaffolding any given new feature. Then, I call in Codex to implement actual functionality since Codex does not have the "You're absolutely right" and mocked/placeholder implementation issues of CC, and just generally writes clean, maintainable, well-planned code. It's the first time I've ever really felt the whole "it's as good as a senior engineer" hype - I think, in most cases, GPT5-Codex finally is as good as a senior engineer for my specific use case.
I think Codex is a generally better product with better pricing, typically 40-50% cheaper for about the same level of daily usage for me compared to CC. I agree that it will take a genuinely novel and material advancement to dethrone Codex now. I think the next frontier for coding agents is speed. I would use CC over Codex if it was 2x or 3x as fast, even at the same quality level. Otherwise, Codex will remain my workhorse.
>I think, in most cases, GPT5-Codex finally is as good as a senior engineer for my specific use case.
This is beyond bananas to me given that I regularly see codex high and Gpt-5-high both fail to create basic react code slightly off the normal distribution.
Im not saying this is a paid endorsement but the internet is dead and I wonder what openAI would pay, if they could, to get such a glowing review as top comment on HN
For what it's worth, I'm not affiliated with Open AI (you can verify by my comment history [1] and account age) and I agree with the top comment. I do Elixir consulting primarily and nothing beats OpenAI's model at the moment for Elixir. Previously, their O3 models were quite decent. But, GPT-5 is really damn good.
Claude code will unnecessarily try to complicate a problem solution.
I'm not saying this was a paid comment, but if we're going to speculate, we could just as easily ask what Anthropic would pay, if they could, to drown out a strongly pro-OpenAI take sitting at the top of their own promotional HN thread.
That said, you're right that the broader internet (Reddit especially) is heavily astroturfed. It's not unusual to see "What's the best X?" threads seeded by marketers, followed by hoard of suspiciously aligned comments.
But without actual evidence, these kind of meta comments like yours (and mine) are just a cynical noise.
I heard this opinion a lot recently. Codex is getting better, and Claude is getting worse so it's must happen sooner or later. Well, it's competition so waiting for Claude to catch up. The web Claude Code is good, but they really need to fix their quota. It's unusable. I would choose a worse model (maybe at 90%), but has better quota and usable. Not to mention GPT-5 and GPT-5-codex seems catch up or even better now.
Are you really going to call someone a shill? I’d argue that you’re why the internet is dying - a million options and you had to choose the most offensive?
The only way to tell human from AI now is disagreeableness, it’s the one thing the GPTs refuse to do. I can’t stand their cloying sycophancy but at least it means that serial complainers will gain some trust, at least for as long as Americans are leading the hunt and deciding to baby us.
I completely agree with this. The amount of unprompted “I used to love Claude Code but now…” content that follows the exact same pattern feels really off. All of these people post without any prompts for comparison, and OP even refused to share specifics so we have to take his claim as ‘trust me bro’
I agree with this and actually Claude Code agrees with it too. I've had Codex cli (gpt-5-codex high) and claude code 4.5 sonnet (and sometimes opus 4.1) do the same lengthier task with the same prompt in cloned folders about 10x now and then I ask them to review the work in the other folder and determine who did the best job.
100% of the time Codex has done a far better job according to both Codex and Claude Code when reviewing. Meeting all the requirements where Claude would leave things out, do them lazily or badly and lose track overall.
Codex high just feels much smarter and more capable than Claude currently and even though it's quite a bit slower, it's work that I don't have to go over again and again to get it to the standards I want.
I share your observations. It's strange to see Anthropic loosing so much ground so fast - they seemed to be the first to crack long-horizon agentic tasks via what I can only assume is an extremely exotic RL process.
Now, I will concede that for non-coding long-horizon tasks, GPT-5 is marginally worse than Sonnet 4.5 in my own scaffolds. But GPT-5 is cheaper, and Sonnet 4.5 is about 2 months newer. However, for coding in a CLI context, GPT-5-Codex is night-and-day better. I don't know how they did it.
Same, I find Codex not good to be honest. I have better success manually copy/pasting into GPT5 chat. There's something about Codex that just wants to change everything and use the weirdest tool commands.
It also often fails to escalate a command, it'll even be like, oh well I'm in a sandbox so I guess I can't do this, and will just not do it and try to find a workaround instead of escalating permission to do the command.
The last time I used them both side by side was a month ago, so unless its significantly improved in the past month, I am genuinely surprised that someone is making the argument that Codex is competitive with ClaudeCode, let alone it somehow being superior.
ClaudeCode is used by me almost daily, and it continues to blow me away. I don't use Codex often because every time I have used it, the output is next to worthless and generally invalid. Even if it does get me what I eventually want, it will take much more prompting for me to get the functioning result. ClaudeCode on the other hand gets me good code from the initial prompt. I'm continually surprised at exactly how little prompting it requires. I have given it challenges with very vague prompts where it really exceeds my expectations.
I use both but I agree that they are generally not on par. I find Claude Code does a better job and doesn't overengineer as much. Where sometime Codex does better is in debugging a tough bug that stumps Claude Code. Codex is also more likely to get lazy and claiming to have finished a large task, when it reality it just wrote some placeholder lines. Claude has never done that. They might be on par soon, however, and I think Anthropic is playing a dangerous game with their limit enforcement on people who are on subscriptions.
Me too, but I know it's not just people shilling, or on the take, because a bunch of people I know personally have moved from Claude Code to Codex, and say it's better.
For me, though, it's not remotely close. Codex has fucked up 95% of the 50-or-so tasks I asked it to do, while Claude Code fucks up only maybe 60%.
I'm big on asking LLMs to do the first major step of something, and then coming back later, and if it looks like it kinda sucks, just Ctrl-C and git revert that container/folder. And I also explicitly set up "here are the commands you need to run to self-check your work" every time. (Which Codex somewhat weirdly sometimes ignores with the explicit (false) claim that it skipped that step because it wasn't requested... hmm.)
So, those kinds of workflow preferences might be a factor, but I haven't seen Codex ever be good yet, and I regret the time I invested trying it too early.
Codex works much better for long-running tasks that require a lot of planning and deep understanding.
Claude, especially 4.5 Sonnet, is a lot nicer to interact with, so it may be a better choice in cases where you are co-working with the agent. Its output is nicer, it "improvises" really well even if you give it only vague prompts. That's valueable for interactive use.
But for delegating complete tasks, Codex is far better. The benchmarks indicate that, as do most practicioners I talk to (and it is indeed my own experience).
In my own work, I use Codex for complete end-to-end tasks, and Claude Sonnet for interactive sessions. They're actually quite different.
I'm on the same page here. I have seen this sentiment about Codex suddenly being good a few times now, so I booted Codex CLI thinking-high back up after a break and asked it to look for bugs. It promptly found five bugs that didn't actually exist. It was the kind of truly impressively stupid mistake that I haven't seen Claude Code make essentially ever, and made me wonder if this isn't the sort of thing that's making people downplay the power of LLMs for agentic coding.
Same here. I tried codex a few days ago for a very simple task (remove any references of X within this long text string) and it fumbled it pretty hard. Very strange.
I can not. We're all racing very hard to take full advantage of these new capabilities before they go mainstream. And to be honest, sharing problem domains that are particularly attractive would be sharing too much. Go forth and experiment. Have fun with it. You'll figure it out pretty fast. You can read my other post here about the kinds of problem spaces I'm looking at.
Yeah this has been my experience as well. The Claude Code UI is still so much better, and the permissioning policy system is much better. Though I'm working on closing that gap by writing a custom policy https://github.com/openai/codex/blob/main/codex-rs/execpolic...
Kinda sick of Codex asking for approval to run tests for each test instance
Still a toss-up for me which one I use. For deep work Codex (codex-high) is the clear winner, but when you need to knock out something small Claude Code (sonnet) is a workhorse.
Also CC tool usage is so much better! Many, many times I’ve seen Codex writing a python script to edit a file which seems to bypass the diff view so you don’t really know what’s going on.
Yeah, after correcting it several times I've gotten Claude Code to tell me it didn't have the expertise to work in one of my problem domains. It was kinda surprising but also kinda refreshing that it knew when to give up. For better or worse I haven't noticed similar things with Codex.
I've chosen problems with non-negotiable outcomes. In other words, problem domains where you either are able to clearly accomplish the very hard thing, or not, and there's no grey area. I've purposely chosen these kinds of problems to prove what AI agents are capable of, so that there is no debate in my mind. And with Codex I've accomplished the previously impossible. Unambiguously. Codex did this. Claude gave up.
It's as if there are two vendors saying they can give up incredibly superpowers for an affordable price, and only one of them actually delivers the full package. The other vendor's powers only work on Tuesdays, and when you're lucky. With that situation, in an environment as competitive as things currently stand, and given the trajectory we're on, Claude is an absolute non-starter for me. Without question.
I did the opposite I switched to Claude code once the released the new model last week of the one before, I tried using codex, but there was issues with the terminal and prompting (multiple characters getting deleted) I found Claude code to have more features and less bugs, like the edit on vim for the prompt being really useful and find it better to iterate. Also I like more its tool usage and the use of the shell. Sometimes codex prefer to use python instead of doing the equivalent shell command. Maybe it's like the other people say here, that codex it's better for long running tasks, I prefer to give Claude small tasks and I'm usually satisfied with the result and I like to work alongside the agent
This is such an interesting perspective because I feel codex is hugely impressive but falls apart on any even remotely difficult task and is too autonomous and not eager enough.
Claude feels like a better fit for an experienced engineer. He's a positive, eager little fellow.
API pricing rates probably. If I take a look at my current usage since it came out, it'd be about $12000 CAD if paid at API rates. Ridiculously easy to rack up absurd bills via the API, and I'm mostly just using it for code review. Someone using it heavily could easily, easily get way over 70k.
My experience is that if I know what I want, CC will produce better code, given I specify it correctly. The planning mode is great for this too, as we can "brainstorm" and what I have seen help a lot is if I ask questions about why it did a certain way. Often it'll figure out on its own why that's wrong, but sometimes it requires a bit of course correction.
On the other hand, last time I tried GPT-5 from Cursor, it was so disappointing. It kept getting confused while we were iterating on a plan, and I had to explain to it multiple times that it's thinking about the problem the same way. After a while I gave up, opened a new chat and gave it my own summary of the conversation (with the wrong parts removed) and then it worked fine. Maybe my initial prompt was vague, but it continually seemed to forget course corrections in that chat.
I mostly tend to use them more to save me from typing, rather than asking it to design things. Occasionally we do a more open ended discussion, but those have great variance. It seems to do better with such discussions online than within the coding tool (I've bounced maths/implementation ideas off of while writing shaders on a personal project)
When you say Claude Code, what model do you refer to? CC with Opus still outperforms Codex (gpt-5-codex) for me for anything I do (Rust, computer graphics-related).
However, Anthropic restricted Opus use for Max plan users 10 days or so ago severly (12-fold from 40h/week down to 5h week) [1].
Sonnet is a vastly inferioir model for my use cases (but still frequently writes better Rust code than Codex). So now I use Codex for planning and Sonnet for writing the code. However, I usually need about 3--5 loops with Codex reviewing, Sonnet fixing, rinse & repeat.
Before I could use one-shot Opus and review myself directly, and do one polish run following my review (also via Opus). That was possible from June--mid October but no more.
Agreed that Opus is stronger than Sonnet 4.5 and GPT-5 High. It's the bitter pill - bigger, more expensive models are just "smarter", even if it doesn't always show in synthetic benchmarks. Similar with o1-pro (now almost a year old, an eternity in this space) vs GPT-5 high. There's also GPT-5 Pro now, which comes at an API cost of $120/M output, and is also noticeably smarter, just like Opus.
They all like to push synthetic benchmarks for marketing, but to me there's zero doubt that both Anthropic and OpenAI are well aware that they're not representative of logical thinking and creativity.
On the topic of comparing OpenAI models with Anthropocene models, I have a hybrid approach that seems really nice.
I set up an MCP tool to use gpt-5 with high reasoning with Claude Code (like tools with "personas" like architect, security reviewer, etc), and I feel that it SIGNIFICANTLY amplifies the performance of Claude alone. I don't see other people using LLMs as tools in these environments, and it's making me wonder if I'm either missing something or somehow ahead of the curve.
Basically instead of "do x (with details)" I say "ask the architect tool for how you should implement X" and it gets into this back and forth that's more productive because it's forcing some "introspection" on the plan.
Sourcegraph Amp (https://sourcegraph.com/amp) has had this exact feature built in for quite a while: "ask the oracle" triggered an O1 Pro sub-agent (now, I believe, GPT-5 High), and searching can be delegated to cheaper, faster, longer-context sub-agents based on Gemini 2.5 Flash.
> We were heavy users of Claude Code ($70K+ spend per year)
Claude code has only been generally available since May last year (a year and half ago)... I'm surprised by the process that you are implying; within a year and a half, you both spent 70k on claude code, and knew enough about it and its competition to switch away from it? I dont think I'd be able to due diligence even if LLM evaluation was my fulltime job. Let alone the fact that the capabilities of each provider are changing dramatically every few weeks.
Claude Code is still good but I don’t TRUST it. With Claude Code and Sonnet I’m expecting failure. I can get things done but there’s an administrative overhead of futzing around with markdown files, defensive commit hooks and unit tests to keep it on rails while managing the context panic. Codex CLI with gpt-5-codex high reasoning is next gen. I’m sure Sonnet 5 will match it soon. At that point I think a lot of the workflows people use in Claude Code will be obsolete and the sycophancy will disappear.
In agreement. Large caveats that can explain differing opinions (that I've experienced) are:
* Is really only magic on Linux or WSL. Mediocre on Windows
* Is quite mediocre at UI code but exceptional at backend, engineering, ops, etc. (I use Claude to spruce up everything user facing -- Codex _can_ mirror designs already in place fairly well).
* Exceptional at certain languages, OK at others.
* GPT-5 and GPT-5-Codex are not the same. Both are models used by the Codex CLI and the GPT-5-Codex model is recent and fantastically good.
* Codex CLI is not "conversational" in the way that Claude is. You kind of interact with it differently.
I often wonder about the impact of different prompting styles. I think the WOW moment for me is that I am no longer returning to code to find tangled messes, duplicate silo'd versions of the same solution (in a different project in the same codebase), or strangely novice style coding and error handling.
As a developer for 20yrs+, using Codex running the GPT-5-Codex model has felt like working with a peer or near-peer for the first time ever. I've been able to move beyond smaller efforts and also make quite a lot of progress that didn't have to be undone/redone. I've used it for a solid month making phenomenal progress and able to offload as-if I had another developer.
Honestly, my biggest concern is that OpenAI is teasing this capable model and then pulls the rug in a month with an "update".
As for the topic at hand, I think Claude Code has without a doubt the best "harness" and interface. It's faster, painless, and has a very clean and readable way of laying out findings when troubleshooting. If there were a cheap and usable version of Opus... perhaps that would keep Claude Code on the cutting edge.
> I've seen Claude Code literally just give up on a hard problem and tell me to buy something off the shelf
I've been seeing more of this lately despite initial excellent results. Not sure what's going on, but the value is certainly dropping for me. I'll have to check out codex. CLI integration is critical for me at this point. For me it is the only thing that actually helps realize the benefits of LLM models we have today. My last NixOS install was completely managed by Claude Code and it worked very well. This was the result of my latest frustrations:
Though I know the statement it made isn't "true". I've had much better luck pursuing other implementation paths with CC in the same space. I could have prompted around this and should have reset the context much earlier but I was drunk "coding" at that point and drove it into a corner.
I haven’t used Codex a lot, but GPT-5 is just a bit smarter in agent mode than Claude 4.5. The most challenging thing I’ve used it for is for code review and GPT-5 somewhat regularly found intricate bugs that Claude missed. However, Claude seemed to be better at following directions exactly vs GPT-5 which requires a lot more precision.
Have tried all and continue to eval regularly. I spend up to 14 hours a day. Currently recovering from a herniated disk because I spent 6 weeks sitting at a dining room table, 14 hours a day, leaning foward. Don't do that. lol. So my coverage is pretty good. I'm using GPT5-codex-high for 99% of my work. Also I have a team of 40 folks, about a third of which are software engineers and the other third are cybersecurity analysts, so I get feedback from them too and we go deep on our engineering calls re the latest learnings and capabilities.
This was my experience too until a couple weeks ago, when Codex suddenly got dumbed down.
Initially, I had great success with codex medium- I could refactor with confidence, code generally ran on the first or second try, etc.
Then when that suddenly dumbed down to Claude Sonnet 3.5 quality I moved to GPT5 High to get back what had been lost. That was okay for a few days. Now GPT5 High has dropped to Claude Sonnet 3.5 quality.
IMO Sonnet 4.5 is great but it just isn’t as comprehensive of a thinker. I love Anthropic and primarily use CC day to day but for any tricky problems or “high stakes, this must not have bugs” issues, I turn to Codex. I do find if you let Codex run on it its own too long it will produce comparably sloppy or lacking-in-vision type issues that people criticize Sonnet for, however.
I'm like 80% sure Sonnet 4.5 is just rebranded Opus.
Sonnet 4 was a coding companion, I could see what it was doing and it did what I asked.
Sonnet 4.5 is like Opus, it generates massive amounts of "helper scripts" and "bootstrap scripts" and all kinds of useless markdown documentation files even for the tinies PoC scripts.
Costs are 6x cheaper and it's way faster and good at test writing and tool calling. It some times can be a bit messy though so use Gemini or Claude or codex for that hard problems....
Similar feeling. Seems it is good at certain things and if something doesnt work it want to do things simply and in turn becomes something that you didnt ask for and certain times opposite of what you wanted. On the other hand with codex certain time you feel the AGI but that is like 2 out of 10 sessions. This is primarily may be due to how complete the prompt and how well you define the problems.
Totally agree. I was just thinking that I wouldn't want this feature for Claude Code but for Codex right now it would be great! I can simply let tasks run in Codex and I know it's going to eventually do what I want. Where as with Claude Code I feel like I have to watch it like a hawk and interrupt it when it goes off the rails.
My experience is similar, but for me, Claude Code is still better when designing or developing a frontend page from scratch. I have seen that Codex follows instructions a bit too literally, and the result can feel a little cold.
CC on the other hand feels more creative and has mostly given better UI.
Of course, once the page is ready, I switch to Codex to build further.
Does no one use Blocks Goose CLI anymore? I went to a hackathon in SF at the beginning of the year and it seemed like 90% of the groups used Goose to do something in their Agent project. I get that the CLI agent scene has exploded since then I just wonder what what is so much better in the competition?
As we all know here, if the the title of this post was about Codex on the web, the top comment would have been about using Claude instead.
YMMV, but this definitely doesn't track with everything I've been seeing and hearing, which is that Codex is inferior to Claude on almost every measure.
fwiw I'm happy to see this - been trying to tackle a hairy problem (rendering bugs) and both models fail, but:
1. Codex takes longer to fail and with less helpful feedback, but tends to at least not produce as many compiler errors
2. Claude fails faster and with more interesting back-and-forth, though tends to fail a bit harder
Neither of them are fixing the problems I want them to fix, so I prefer the faster iteration and back-and-forth so I can guide it better
So it's a bit surprising to me when so many people are pickign a "clear winner" that I prefer less atm
They put all of their eggs in the coding basket, with the rest of their mission couched as "effective altruism," or "safetyism," or "solving alignment," (all terms they are more loudly attempting to distance themselves from[0], because it's venture kryptonite).
Meanwhile, all OpenAI had to do was point their training cannon at it for a run, and suddenly Anthropic is irrelevant. OpenAI's focus as a consumer company (and growing as a tool company) is a safe, venture-backable bet.
Frontier AI doesn't feel like a zero-sum game, but for now, if you're betting on AI at all, you can really only bet on OpenAI, like Tesla being a proxy for the entire EV industry.
For non-vibe coding purposes I've found that my $200 Claude (Claude Code) account regularly outperformed my $200 ChatGPT (Codex) account. This was after 2 months of heavily testing both mostly in Terminal TUI/CLI form and most recently with the latest VSCode/Cursor incarnations.
Even with the additional Sora usage and other bells & whistles that ChatGPT @ $200 provides, Claude provides more value for my use cases.
Claude Code is just a lot more comfortable being in your workflow and being a companion or going full 'agent(s)' and running for 30 minutes on one ticket. It's also a lot happier playing with Agents from other APIs.
There's nothing wrong with Anthropic wanting to completely own that segment and not have aspirations of world domination like OpenAI. I don't see how that's a negative.
If anything, the more ChatGPT becomes a 'everything app' the less likely I am to hold on to my $20 account after cancelling the $200 account. I'm finding the more it knows about me the more creeped out and "I didn't ask for this" I become.
It's really solid. It's effectively a web (and native mobile) UI over Claude Code CLI, more specifically "claude --dangerously-skip-permissions".
Anthropic have recognized that Claude Code where you don't have to approve every step is massively more productive and interesting than the default, so it's worth investing a lot of resources in sandboxing.
It’s interesting because I’ve slowly arrived at the opposite conclusion: for much of my practical day to day work, using CC with “allow edits” turned OFF results in a much better end product. I can correct it inline, I pseudo-review the code as it’s produced, etc etc. Codex is better for “fire and forget” features for sure. But Claude remains excellent at grokking intent for problems where you aren’t quite sure what you want to build yet or are highly opinionated. Mostly due to the fact it’s faster and the iteration loop is faster.
That approach should work well for projects where you are directly working on the code in tandem with Claude, but a lot of my own uses are much more research oriented. I like sending Claude Code off on a mission figure out how to do something.
It slows it down far too much for me. What I've found after swithcing to --dangerously-skip-permissions is that while the intermediate work product is often total junk, when I then start writing a message to tell Claude to switch approach, a large proportion of the time it has figured that out by itself before I'm finished writing the message.
So increasingly I let it run, and then review when it stops, and then I give it a proper review, and let it run until it stops again. It wastes far less of my time, and finishes new code much faster. At least for the things I've made it do.
> it's worth investing a lot of resources in sandboxing.
I tend to agree. There’s an opportunity to make it easy to have Claude be able to test out workflows/software within Debian, RPM, Windows, etc… container and VM sandboxes. This could be helpful for users that want to release code on multiple platforms and help their own training and testing, which they seem to be heavily invested in given all the “How Am I doing?” prompts we’re getting.
Do you have a practical sense of the level of mischief possible in the sandbox? It seems like a game of regexp whack-a-mole to me, which seems like a predictable recipe for decades of security problems. Allow- and deny-lists for files and domains seem about as secure as backslash-escaping user input before passing it to the shell.
If you configure it with the "no network access" environment there's nothing bad that can happen. Worst is you end up wasting a bunch of CPU cycles in a container somewhere in Anthropic's infrastructure.
great points @simonw - I, incredibly, haven't ever tried --dangerously-skip-permissions yet for any "real" projects. I generally find that it stops itself for good reason.
The most interesting parts of this to me are somewhat buried:
- Claude Code has been added to iOS
- Claude Code on the Web allows for seamless switching to Claude Code CLI
- They have open sourced an OS-native sandboxing system which limits file system and network access _without_ needing containers
However, I find the emphasis on limiting the outbound network access somewhat puzzling because the allowlists invariably include domains like gist.github.com and dozens of others which act effectively as public CMS’es and would still permit exfiltration with just a bit of extra effort.
I used `sandbox-exec` previously before moving to a better solution (done right, sandboxing on macOS can be more powerful than Linux imo). The way `sandbox-exec` works is that all child processes inherit the same restrictions. For example, if you run `sandbox-exec $rules claude --dangerously-skip-permissions`, any commands executed by Claude through a shell will also be bound by those same rules. Since the sandbox settings are applied globally, you currently can’t grant or deny granular read/write permissions to specific tools.
Using a proxy through the `HTTP_PROXY` or `HTTPS_PROXY` environment variables has its own issues. It relies on the application respecting those variables—if it doesn’t, the connection will simply fail. Sure, in this case since all other network connection requests are dropped you are somewhat protected but then an application that doesn't respect them will just not work
You can also have some fun with `DYLD_INSERT_LIBRARIES`, but that often requires creating shims to make it work with codesigned binaries
Exfiltration is always going to be possible, the question is, is it difficult enough for an attacker to succeed against the defenses I've put in place. The problem is, I really want to share, and help protect others, but if I write it up somewhere anybody can read, it's gonna end up in the training data.
I feel like these background agents still aren't doing what I want from a developer experience perspective. Running in an inaccessible environment that pushes random things to branches that I then have to checkout locally doesn't feel great.
AI coding should be tightly in the inner dev loop! PRs are a bad way to review and iterate on code. They are a last line of defense, not the primary way to develop.
Give me an isolated environment that is one click hooked up to Cursor/VSCode Remote SSH. It should be the default. I can't think of a single time that Claude or any other AI tool nailed the request on the first try (other than trivial things). I always need to touch it up or at least navigate around and validate it in my IDE.
Right, that is closer to what I was hoping this announcement would be. I really just want a (mobile/web) companion to whatever CLI environment I have Claude Code running in. That would perfectly fill in the exact niche missing in my local dev server VM setup I remote into with any combination of SSH, VS Code Remote, or via Web (VS Code Tunnel from vscode.dev and a ttyd remote CLI session in the browser).
It would be great to be able to check in on Claude on a walk or something to make sure it hasn't gone off the rails or send it a quick "LGTM" to keep moving down a large PLAN.md file without being tethered to a keyboard and monitor. I can SSH from my phone but the CLI ergonomics are ... not great with an on screen keyboard, when all it really needs is just needs a simple threaded chat UI.
I've seen a couple Github projects and "Happy Coder" on a Show HN which I haven't got around to setting up yet which seem in the ballpark of what I want, but a first party integration would always be cool.
I tried Happy Coder for a bit. It seemed exactly what I was missing but about 1/2 the time session notifications weren't coming through and the developers of the tool seem busy pushing it off in other directions rather than in making the core functionality bullet-proof so I gave up on it. Unfortunate. Hopefully something else pops up or Anthropic bakes it into their own tooling.
I agree and I also think the problem is deeper than that. It's about not being able to do most code testing and debugging remotely. You can't really test anything remotely really... Its in an ephemeral container without any of your data, just your repo. You can't have the model do npm run dev and browse to see the webpage, click around, etc. You can't compile or run anything heavy, you can't persist data across sessions/days, etc.
I like the idea of background agents running in the cloud but it has to be a more persistent environment. It also has to run on a GUI so it can develop web applications or run the programs we are developing, and run them properly with the GUI and requiring clicking around, typing things etc. Computer use, is what we need. But that would probably be too expensive to serve to the masses with the current models
Definitely sounds cool. But the problem hasn't even been solved locally yet. Distributed microservices, 3rd party dependencies, async callbacks, reasonable test data, unsatisfiable validations, etc. Every company has their own hacked together local testing thing that mostly doesn't work.
That said, maybe this is the turning point where these companies work toward solving it in earnest, since it's a key differentiator of their larger PLATFORM and not just a cost. Heck, if they get something like that working well, I'd pay for it even without the AI!
Edit: that could end up being really slick too if it was able to learn from your teammates and offer guidance. Like when you're checking some e2e UI flows but you need a test item that has some specific detail, it maybe saw how your teammate changed the value or which item they used or created, and can copy it for you. "Hey it looks like you're trying to test this flow. Here's how Chen did it. Want me to guide you through that?" They can't really do that with just CLI, so the web interface could really be a game changer if they take full advantage of it.
What you're describing feels like the next major evolution and is likely years away (and exciting!).
I'm mainly aiming for a good experience with what we have today. Welding an AI agent onto my IDE turned out to be great. The next incremental step feels like being able to parallelize that. I want four concurrent IDEs with AI welded onto it.
idk, we’ve (humans) gotten this far with them. I don’t think they are the right tool for AI generated code and coding agents though, and that these circles are being forced to fit into those squares. imho it’s time for an AI-native git or something.
PRs work well for what they are. Ship off some changes you're strongly confident about and have another human who has a lot of context read through it and double check you. It's for when you think you've finished your inner loop.
AI is more akin to pair programming with another person sitting next to you. I don't want to ship a PR or even a branch off to someone sitting next to me. I want to discuss and type together in real time.
Agree, each agent creating a PR and then coordinating merges is a pain.
I’d like
- agent to consolidate simple non-conflicting PRs
- faster previews and CI tests (Render currently)
- detect and suggest solutions for merge conflicts
Codex web doesn’t update the PR which is also something to change, maybe a setting, but for web Code agents (?) I’d like the PR once opened to stay open
Also PRs need an overhaul in general. I create lots of speculative agents, if I like the solution I merge, leading to lots of PRs
Thank you. Every time these agentic cloud tools come out, I wonder to myself whether I'm not using them right or misunderstand vs, say, local Cursor development paradigm.
Plus they generate so much noise with all the extra commits and comments that go to everyone in slack and email rather than just me.
I just run the agent directly on separate testing/dev servers via remote-ssh in VS Code to have an IDE to sanity check stuff. Just far simpler than local dev and other nonsense.
this is a great point. The inner / outer loop is big. I think AI pushing PRs is kind of like pushing drafts to the public in social media. I don't want folks seeing PRs and such until I feel good about it. It adds a lot of noise, and increases build costs unless your CI/CD treats them differently which I don't know anyone doing.
Yes, but my point is often times I don't want to. Sometimes there are changes I can make it seconds. I don't want to wait 15+ seconds for an AI that might do it wrong or do too much.
Also it isn't always about editing. It is about seeing the surrounding code, navigating around, and ensuring the AI did the right thing in all of the right places.
Huge waste of time. You are being sold a bill of goods whose only purpose is to make you a dumb dev. Like woah, an llm can use cdp!! Who cares. Cant wait till people start waking up to this grift. These things are making people so dumb and a few richer, thats it.
Hey, Kanjun from Imbue here! This is exactly why we built Sculptor (https://imbue.com/sculptor), a desktop UI for Claude Code.
Each agent has its own isolated container. With Pairing Mode, you can sync the agent's code and git state directly into your local Cursor/any IDE so you can instantly validate its work. The sync is bidirectional so your local changes flow back to the agent in realtime.
Happy to answer any questions - I think you'll really like the tight feedback loop :)
> We were heavy users of Claude Code ($70K+ spend per year) and have almost completely switched to codex CLI
Seeing comments like this all over the place. I switched to CC from Cursor in June / July because I saw the same types of comments. I switched from VSCode + Copilot about 8 months before that for the same reason. I remember being skeptical that this sort of thing was guerilla marketing, but CC was in fact better than Cursor. Guess I'll try Codex, and I guess that it's good that there are multiple competing products making big strides.
Never would have imagined myself ditching IDEs and workflows 3x in a few months. A little exhausting
I think it’s a lot less exhausting now that the IDE part is mostly decoupled. I can’t imagine cursor continuing to compete when really all they’re doing is selling tokens either a markup, and hence crushing your context on every call. Sorry if that sounds negative but it’s true.
I use CC and codex somewhat interchangeably, but I have to agree with the comments. Codex is a compete monster, and there really isn’t any competition right now.
OpenAI seems to limit how "hard" your gpt-5-codex can think depending on your subscription plan; whereas Anthropic/Claude only limits how much use you get. I evaluate Codex every month or so with a problem suited to it, but rarely gets merged over a version produced by Charlie (which yes is $500/mo, but rarely causes problems) or something Claude did in a managed or unmanaged session. ymmv
No relations to them, but I've started using Happy[0]'s iOS app to start and continue Claude Code sessions on my iPhone. It allows me to run sessions on a custom environment, like a machine with a GPU to train models
This seems to be the only solution still if using bedrock or direct API access instead of Pro / Max plan, the Claude Code for Web doesn't seem to let you use it that way.
You can log in to your CC instance however you like, including via Pro/Max. Happy just wraps it and provides remote access with a much better UI than using a phone-based terminal app.
I was just working on something similar for OpenCode - pushing it now in case it's useful for someone[0].
It can run in a front-end only mode (I'll put up a hosted version soon), and then you need to specify your OpenCode API server and it'll connect to it. Alternatively, it can spin up the API server itself and proxy it, and then you just need to expose (securely) the server to the internet.
The UI is responsive and my main idea was that I can easily continue directing the AI from my phone, but it's also of course possible to just spin up new sessions. So often I have an idea while I'm away from my keyboard, and being up able to just say "create an X" and let it do its thing while I'm on the go is quite exciting.
It doesn't spin up a special sandbox environment or anything like that, but you're really free to run it inside whatever sandboxing solution you want. And unlike Claude Code, you're of course free to choose whatever model you want.
I've been using Happy Coder[0] for some time now on web and mobile. I run it `--yolo` mode on an isolated VM across multiple projects.
With Happy, I managed to turn one of these Claude Code instances into a replacement for Claude that has all the MCP goodness I could ever want and more.
I don't have any relationship with any AI company, and honestly I was rooting for Anthropic, but Codex CLI is just way way better.
Also Codex CLI is cheaper than Claude Code.
I think Anthropic are going to have to somehow leapfrog OpenAI to regain the position they were in around June of this year. But right now they're being handed their hat.
Which means it wasn’t true any of the previous times, so why would it be true this time? It feels like an endless loop of the “friendship ended” meme with AI companies.
https://knowyourmeme.com/editorials/guides/what-is-the-frien...
It’s much more likely commenters are still in the honeymoon hype phase and (again) haven’t found the problems because they’re hyper focused on what the new thing is good at that the previous one wasn’t, ignoring the other flaws. I see that a lot with human relationships as well, where people latch on to new partners because they obviously don’t have the big problem that was a strain on the previous relationship. But eventually something else arises. Rinse and repeat.
Depending on the time of the year you can expect fresh updates from any given company on how their new models and tools perform and they'll generally blow the competition out of the water.
The trick is to either realize that your current tools will just become magically better in a few months OR lean in and switch companies as their tools and models update.
GPT-5 is not the final deal, but it's incredibly good as is at coding.
Anecdotal, but it's something completely else in terms of capabilities, ignore it at your own peril, but I think it will profoundly change software development.
Many people are complaning about this on HN and Reddit. I do not have any proof, but there is a pattern, I suppose Antrophic first attracts customers, then starts to optimize costs/margins.
If we ship more for less because the new agent doesn't tap out, that's not a honeymoon, it's an upgrade.
- Good bash command permission system
- Rollbacks coupled with conversation and code
- Easy switching between approval modes (Claude had a keybind that makes this easy)
- Ability to send messages while it’s working (Codex just queues them up for after it’s done, Claude injects them into the current task)
- Codex is very frustrating when I have to keep allowing it to run the same commands over and over, Claude this works well when I approve it to run a command for the session
- Agents (these are very useful for controlling context)
- A real plan mode (crucial)
- Skills (these are basically just lazy loaded context and are amazing)
- The sandboxing in codex is so confusing, commands fail all the time because they try to log to some system directory or use internet access which is blocked by default and hard to figure out
- Codex prefers python snippets to bash commands which is very hard to permission and audit
When Codex gets to feature parity, I’ll seriously look at switching, but until then it’s just a really good model wrapped in an okay harness
But these CLI tools are still fairly thin wrappers around an LLM. Remember: they're "just an LLM in a while loop with access to tool calls." (I exaggerate, and I love Claude Code's more advanced features like "skills" as much as anyone, but at the core, that's what they are.) The real issue at stake is what is the better LLM behind the agent: is GPT-5 or Sonnet 4.5 better at coding. On that I think opinion is split.
Incidentally, you can run Claude Code with GPT-5 if you want a fair(er) comparison. You need a proxy like LiteLLM and you will have to use the OpenAI api and pay per-token, but it's not hard to do and quite interesting. I haven't used it enough to make a good comparison, however.
The anthropic (ClaudeCode) tooling is best-in-class to me. You listed many features that I have become so reliant on now, that I consider them the Ante that other competitors need to even be considered.
I have been very impressed with the Anthropic agent for code generation and review. I have found the OpenAI agent to be significantly lacking by comparison. But to be fair, the last time I used OpenAI's agent for code was about a month ago, so maybe it has improved recently (not at all unreasonable in this space). But at least a month ago when using them side-by-side the codex CLI was VERY basic compared to the wealth of features and UI in the ClaudeCode CLI. The agents for Claude were also so much better than OpenAI, that it wasn't even close. OpenAI has always delivered me improper code (non-working or invalid) at a very high rate, whereas Claude is generally valid code, the debate is just whether it is the desired way to build something.
https://github.com/just-every/code
Fixed all of these in a heartbeat. This has been a game changer
This is especially the case in a fast moving field such as this. You would not want to get stuck in the same local minimum as your competitor.
I would rather we have competing products that try different things to arrive at a better solution overall.
Deleted Comment
Deleted Comment
Initially, I found Codex CLI with GPT-5 to be a substitute for Claude Code - now GPT-5 Codex materially surpasses it in my line of work, with a huge asterisk. I work in a niche industry, and Codex has generally poor domain understanding of many of the critical attributes and concepts. Claude happens to have better background knowledge for my tasks, so I've found that Sonnet 4.5 with Claude Code generally does a better job at scaffolding any given new feature. Then, I call in Codex to implement actual functionality since Codex does not have the "You're absolutely right" and mocked/placeholder implementation issues of CC, and just generally writes clean, maintainable, well-planned code. It's the first time I've ever really felt the whole "it's as good as a senior engineer" hype - I think, in most cases, GPT5-Codex finally is as good as a senior engineer for my specific use case.
I think Codex is a generally better product with better pricing, typically 40-50% cheaper for about the same level of daily usage for me compared to CC. I agree that it will take a genuinely novel and material advancement to dethrone Codex now. I think the next frontier for coding agents is speed. I would use CC over Codex if it was 2x or 3x as fast, even at the same quality level. Otherwise, Codex will remain my workhorse.
This is a really neat way of describing the phenomenon I've been experiencing and trying to articulate, cheers!
This is beyond bananas to me given that I regularly see codex high and Gpt-5-high both fail to create basic react code slightly off the normal distribution.
[1] https://news.ycombinator.com/item?id=45491842
That said, you're right that the broader internet (Reddit especially) is heavily astroturfed. It's not unusual to see "What's the best X?" threads seeded by marketers, followed by hoard of suspiciously aligned comments.
But without actual evidence, these kind of meta comments like yours (and mine) are just a cynical noise.
Deleted Comment
100% of the time Codex has done a far better job according to both Codex and Claude Code when reviewing. Meeting all the requirements where Claude would leave things out, do them lazily or badly and lose track overall.
Codex high just feels much smarter and more capable than Claude currently and even though it's quite a bit slower, it's work that I don't have to go over again and again to get it to the standards I want.
Now, I will concede that for non-coding long-horizon tasks, GPT-5 is marginally worse than Sonnet 4.5 in my own scaffolds. But GPT-5 is cheaper, and Sonnet 4.5 is about 2 months newer. However, for coding in a CLI context, GPT-5-Codex is night-and-day better. I don't know how they did it.
Its very odd because I was hoping they were very on par.
It also often fails to escalate a command, it'll even be like, oh well I'm in a sandbox so I guess I can't do this, and will just not do it and try to find a workaround instead of escalating permission to do the command.
ClaudeCode is used by me almost daily, and it continues to blow me away. I don't use Codex often because every time I have used it, the output is next to worthless and generally invalid. Even if it does get me what I eventually want, it will take much more prompting for me to get the functioning result. ClaudeCode on the other hand gets me good code from the initial prompt. I'm continually surprised at exactly how little prompting it requires. I have given it challenges with very vague prompts where it really exceeds my expectations.
For me, though, it's not remotely close. Codex has fucked up 95% of the 50-or-so tasks I asked it to do, while Claude Code fucks up only maybe 60%.
I'm big on asking LLMs to do the first major step of something, and then coming back later, and if it looks like it kinda sucks, just Ctrl-C and git revert that container/folder. And I also explicitly set up "here are the commands you need to run to self-check your work" every time. (Which Codex somewhat weirdly sometimes ignores with the explicit (false) claim that it skipped that step because it wasn't requested... hmm.)
So, those kinds of workflow preferences might be a factor, but I haven't seen Codex ever be good yet, and I regret the time I invested trying it too early.
Claude, especially 4.5 Sonnet, is a lot nicer to interact with, so it may be a better choice in cases where you are co-working with the agent. Its output is nicer, it "improvises" really well even if you give it only vague prompts. That's valueable for interactive use.
But for delegating complete tasks, Codex is far better. The benchmarks indicate that, as do most practicioners I talk to (and it is indeed my own experience).
In my own work, I use Codex for complete end-to-end tasks, and Claude Sonnet for interactive sessions. They're actually quite different.
Sora 4.5 tends to randomly hallucinate odd/inappropriate decisions and goes to make stupid changes that have to be patched up manually.
Kinda sick of Codex asking for approval to run tests for each test instance
https://zed.dev/blog/codex-is-live-in-zed
Also CC tool usage is so much better! Many, many times I’ve seen Codex writing a python script to edit a file which seems to bypass the diff view so you don’t really know what’s going on.
It's as if there are two vendors saying they can give up incredibly superpowers for an affordable price, and only one of them actually delivers the full package. The other vendor's powers only work on Tuesdays, and when you're lucky. With that situation, in an environment as competitive as things currently stand, and given the trajectory we're on, Claude is an absolute non-starter for me. Without question.
Claude feels like a better fit for an experienced engineer. He's a positive, eager little fellow.
On the other hand, last time I tried GPT-5 from Cursor, it was so disappointing. It kept getting confused while we were iterating on a plan, and I had to explain to it multiple times that it's thinking about the problem the same way. After a while I gave up, opened a new chat and gave it my own summary of the conversation (with the wrong parts removed) and then it worked fine. Maybe my initial prompt was vague, but it continually seemed to forget course corrections in that chat.
I mostly tend to use them more to save me from typing, rather than asking it to design things. Occasionally we do a more open ended discussion, but those have great variance. It seems to do better with such discussions online than within the coding tool (I've bounced maths/implementation ideas off of while writing shaders on a personal project)
when making boilerplatish changes in the product in areas I'm not familiar with (it's a large codebase) gpt-5-high is a monster.
However, Anthropic restricted Opus use for Max plan users 10 days or so ago severly (12-fold from 40h/week down to 5h week) [1].
Sonnet is a vastly inferioir model for my use cases (but still frequently writes better Rust code than Codex). So now I use Codex for planning and Sonnet for writing the code. However, I usually need about 3--5 loops with Codex reviewing, Sonnet fixing, rinse & repeat.
Before I could use one-shot Opus and review myself directly, and do one polish run following my review (also via Opus). That was possible from June--mid October but no more.
[1] https://github.com/anthropics/claude-code/issues/8449
They all like to push synthetic benchmarks for marketing, but to me there's zero doubt that both Anthropic and OpenAI are well aware that they're not representative of logical thinking and creativity.
I set up an MCP tool to use gpt-5 with high reasoning with Claude Code (like tools with "personas" like architect, security reviewer, etc), and I feel that it SIGNIFICANTLY amplifies the performance of Claude alone. I don't see other people using LLMs as tools in these environments, and it's making me wonder if I'm either missing something or somehow ahead of the curve.
Basically instead of "do x (with details)" I say "ask the architect tool for how you should implement X" and it gets into this back and forth that's more productive because it's forcing some "introspection" on the plan.
Sourcegraph Amp (https://sourcegraph.com/amp) has had this exact feature built in for quite a while: "ask the oracle" triggered an O1 Pro sub-agent (now, I believe, GPT-5 High), and searching can be delegated to cheaper, faster, longer-context sub-agents based on Gemini 2.5 Flash.
Claude code has only been generally available since May last year (a year and half ago)... I'm surprised by the process that you are implying; within a year and a half, you both spent 70k on claude code, and knew enough about it and its competition to switch away from it? I dont think I'd be able to due diligence even if LLM evaluation was my fulltime job. Let alone the fact that the capabilities of each provider are changing dramatically every few weeks.
Fails to escalate permissions, gets derailed, loves changing too many things everywhere.
GPT5 is good, but codex is not.
* Is really only magic on Linux or WSL. Mediocre on Windows
* Is quite mediocre at UI code but exceptional at backend, engineering, ops, etc. (I use Claude to spruce up everything user facing -- Codex _can_ mirror designs already in place fairly well).
* Exceptional at certain languages, OK at others.
* GPT-5 and GPT-5-Codex are not the same. Both are models used by the Codex CLI and the GPT-5-Codex model is recent and fantastically good.
* Codex CLI is not "conversational" in the way that Claude is. You kind of interact with it differently.
I often wonder about the impact of different prompting styles. I think the WOW moment for me is that I am no longer returning to code to find tangled messes, duplicate silo'd versions of the same solution (in a different project in the same codebase), or strangely novice style coding and error handling.
As a developer for 20yrs+, using Codex running the GPT-5-Codex model has felt like working with a peer or near-peer for the first time ever. I've been able to move beyond smaller efforts and also make quite a lot of progress that didn't have to be undone/redone. I've used it for a solid month making phenomenal progress and able to offload as-if I had another developer.
Honestly, my biggest concern is that OpenAI is teasing this capable model and then pulls the rug in a month with an "update".
As for the topic at hand, I think Claude Code has without a doubt the best "harness" and interface. It's faster, painless, and has a very clean and readable way of laying out findings when troubleshooting. If there were a cheap and usable version of Opus... perhaps that would keep Claude Code on the cutting edge.
I've been seeing more of this lately despite initial excellent results. Not sure what's going on, but the value is certainly dropping for me. I'll have to check out codex. CLI integration is critical for me at this point. For me it is the only thing that actually helps realize the benefits of LLM models we have today. My last NixOS install was completely managed by Claude Code and it worked very well. This was the result of my latest frustrations:
https://i.imgur.com/C4nykhA.png
Though I know the statement it made isn't "true". I've had much better luck pursuing other implementation paths with CC in the same space. I could have prompted around this and should have reset the context much earlier but I was drunk "coding" at that point and drove it into a corner.
Initially, I had great success with codex medium- I could refactor with confidence, code generally ran on the first or second try, etc.
Then when that suddenly dumbed down to Claude Sonnet 3.5 quality I moved to GPT5 High to get back what had been lost. That was okay for a few days. Now GPT5 High has dropped to Claude Sonnet 3.5 quality.
There's nothing left to fallback to.
Sonnet 4 was a coding companion, I could see what it was doing and it did what I asked.
Sonnet 4.5 is like Opus, it generates massive amounts of "helper scripts" and "bootstrap scripts" and all kinds of useless markdown documentation files even for the tinies PoC scripts.
Costs are 6x cheaper and it's way faster and good at test writing and tool calling. It some times can be a bit messy though so use Gemini or Claude or codex for that hard problems....
CC on the other hand feels more creative and has mostly given better UI.
Of course, once the page is ready, I switch to Codex to build further.
YMMV, but this definitely doesn't track with everything I've been seeing and hearing, which is that Codex is inferior to Claude on almost every measure.
1. Codex takes longer to fail and with less helpful feedback, but tends to at least not produce as many compiler errors 2. Claude fails faster and with more interesting back-and-forth, though tends to fail a bit harder
Neither of them are fixing the problems I want them to fix, so I prefer the faster iteration and back-and-forth so I can guide it better
So it's a bit surprising to me when so many people are pickign a "clear winner" that I prefer less atm
I thought claude code is still better in tool calling and something like that
They put all of their eggs in the coding basket, with the rest of their mission couched as "effective altruism," or "safetyism," or "solving alignment," (all terms they are more loudly attempting to distance themselves from[0], because it's venture kryptonite).
Meanwhile, all OpenAI had to do was point their training cannon at it for a run, and suddenly Anthropic is irrelevant. OpenAI's focus as a consumer company (and growing as a tool company) is a safe, venture-backable bet.
Frontier AI doesn't feel like a zero-sum game, but for now, if you're betting on AI at all, you can really only bet on OpenAI, like Tesla being a proxy for the entire EV industry.
[0] https://forum.effectivealtruism.org/posts/53Gc35vDLK2u5nBxP/...
Even with the additional Sora usage and other bells & whistles that ChatGPT @ $200 provides, Claude provides more value for my use cases.
Claude Code is just a lot more comfortable being in your workflow and being a companion or going full 'agent(s)' and running for 30 minutes on one ticket. It's also a lot happier playing with Agents from other APIs.
There's nothing wrong with Anthropic wanting to completely own that segment and not have aspirations of world domination like OpenAI. I don't see how that's a negative.
If anything, the more ChatGPT becomes a 'everything app' the less likely I am to hold on to my $20 account after cancelling the $200 account. I'm finding the more it knows about me the more creeped out and "I didn't ask for this" I become.
It's really solid. It's effectively a web (and native mobile) UI over Claude Code CLI, more specifically "claude --dangerously-skip-permissions".
Anthropic have recognized that Claude Code where you don't have to approve every step is massively more productive and interesting than the default, so it's worth investing a lot of resources in sandboxing.
Here's an example from this morning, getting CUDA working on a NVIDIA Spark: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-co...
I have a few more in https://github.com/simonw/research
So increasingly I let it run, and then review when it stops, and then I give it a proper review, and let it run until it stops again. It wastes far less of my time, and finishes new code much faster. At least for the things I've made it do.
I tend to agree. There’s an opportunity to make it easy to have Claude be able to test out workflows/software within Debian, RPM, Windows, etc… container and VM sandboxes. This could be helpful for users that want to release code on multiple platforms and help their own training and testing, which they seem to be heavily invested in given all the “How Am I doing?” prompts we’re getting.
Their "restricted network access" setting looks questionable to me - it allow-lists a LOT of stuff: https://docs.claude.com/en/docs/claude-code/claude-code-on-t...
If you configure your own allow-list you can restrict to just domains that you trust - which is enforced by a separate HTTP/HTTPS proxy, described here: https://docs.claude.com/en/docs/claude-code/claude-code-on-t...
- Claude Code has been added to iOS
- Claude Code on the Web allows for seamless switching to Claude Code CLI
- They have open sourced an OS-native sandboxing system which limits file system and network access _without_ needing containers
However, I find the emphasis on limiting the outbound network access somewhat puzzling because the allowlists invariably include domains like gist.github.com and dozens of others which act effectively as public CMS’es and would still permit exfiltration with just a bit of extra effort.
Using a proxy through the `HTTP_PROXY` or `HTTPS_PROXY` environment variables has its own issues. It relies on the application respecting those variables—if it doesn’t, the connection will simply fail. Sure, in this case since all other network connection requests are dropped you are somewhat protected but then an application that doesn't respect them will just not work
You can also have some fun with `DYLD_INSERT_LIBRARIES`, but that often requires creating shims to make it work with codesigned binaries
AI coding should be tightly in the inner dev loop! PRs are a bad way to review and iterate on code. They are a last line of defense, not the primary way to develop.
Give me an isolated environment that is one click hooked up to Cursor/VSCode Remote SSH. It should be the default. I can't think of a single time that Claude or any other AI tool nailed the request on the first try (other than trivial things). I always need to touch it up or at least navigate around and validate it in my IDE.
It would be great to be able to check in on Claude on a walk or something to make sure it hasn't gone off the rails or send it a quick "LGTM" to keep moving down a large PLAN.md file without being tethered to a keyboard and monitor. I can SSH from my phone but the CLI ergonomics are ... not great with an on screen keyboard, when all it really needs is just needs a simple threaded chat UI.
I've seen a couple Github projects and "Happy Coder" on a Show HN which I haven't got around to setting up yet which seem in the ballpark of what I want, but a first party integration would always be cool.
I like the idea of background agents running in the cloud but it has to be a more persistent environment. It also has to run on a GUI so it can develop web applications or run the programs we are developing, and run them properly with the GUI and requiring clicking around, typing things etc. Computer use, is what we need. But that would probably be too expensive to serve to the masses with the current models
That said, maybe this is the turning point where these companies work toward solving it in earnest, since it's a key differentiator of their larger PLATFORM and not just a cost. Heck, if they get something like that working well, I'd pay for it even without the AI!
Edit: that could end up being really slick too if it was able to learn from your teammates and offer guidance. Like when you're checking some e2e UI flows but you need a test item that has some specific detail, it maybe saw how your teammate changed the value or which item they used or created, and can copy it for you. "Hey it looks like you're trying to test this flow. Here's how Chen did it. Want me to guide you through that?" They can't really do that with just CLI, so the web interface could really be a game changer if they take full advantage of it.
I'm mainly aiming for a good experience with what we have today. Welding an AI agent onto my IDE turned out to be great. The next incremental step feels like being able to parallelize that. I want four concurrent IDEs with AI welded onto it.
idk, we’ve (humans) gotten this far with them. I don’t think they are the right tool for AI generated code and coding agents though, and that these circles are being forced to fit into those squares. imho it’s time for an AI-native git or something.
AI is more akin to pair programming with another person sitting next to you. I don't want to ship a PR or even a branch off to someone sitting next to me. I want to discuss and type together in real time.
I’d like
- agent to consolidate simple non-conflicting PRs
- faster previews and CI tests (Render currently)
- detect and suggest solutions for merge conflicts
Codex web doesn’t update the PR which is also something to change, maybe a setting, but for web Code agents (?) I’d like the PR once opened to stay open
Also PRs need an overhaul in general. I create lots of speculative agents, if I like the solution I merge, leading to lots of PRs
Plus they generate so much noise with all the extra commits and comments that go to everyone in slack and email rather than just me.
[1] https://ona.com/
I want to run a prompt that operates in an isolated environment that is open in my IDE where I can iterate with the AI. I think maybe it can do this?
Also it isn't always about editing. It is about seeing the surrounding code, navigating around, and ensuring the AI did the right thing in all of the right places.
Dead Comment
Each agent has its own isolated container. With Pairing Mode, you can sync the agent's code and git state directly into your local Cursor/any IDE so you can instantly validate its work. The sync is bidirectional so your local changes flow back to the agent in realtime.
Happy to answer any questions - I think you'll really like the tight feedback loop :)
Seeing comments like this all over the place. I switched to CC from Cursor in June / July because I saw the same types of comments. I switched from VSCode + Copilot about 8 months before that for the same reason. I remember being skeptical that this sort of thing was guerilla marketing, but CC was in fact better than Cursor. Guess I'll try Codex, and I guess that it's good that there are multiple competing products making big strides.
Never would have imagined myself ditching IDEs and workflows 3x in a few months. A little exhausting
I use CC and codex somewhat interchangeably, but I have to agree with the comments. Codex is a compete monster, and there really isn’t any competition right now.
[0] https://github.com/slopus/happy/
It can run in a front-end only mode (I'll put up a hosted version soon), and then you need to specify your OpenCode API server and it'll connect to it. Alternatively, it can spin up the API server itself and proxy it, and then you just need to expose (securely) the server to the internet.
The UI is responsive and my main idea was that I can easily continue directing the AI from my phone, but it's also of course possible to just spin up new sessions. So often I have an idea while I'm away from my keyboard, and being up able to just say "create an X" and let it do its thing while I'm on the go is quite exciting.
It doesn't spin up a special sandbox environment or anything like that, but you're really free to run it inside whatever sandboxing solution you want. And unlike Claude Code, you're of course free to choose whatever model you want.
[0] https://github.com/bjesus/opencode-web
With Happy, I managed to turn one of these Claude Code instances into a replacement for Claude that has all the MCP goodness I could ever want and more.
[0]: https://happy.engineering/