Claude Code on the web

We were heavy users of Claude Code ($70K+ spend per year) and have almost completely switched to codex CLI. I'm doing massive lifts with it on software that would never before have been feasible for me personally, or any team I've ever run. I'll use Claude Code maybe once every two weeks as a second set of eyes to inspect code and document a bug, with mixed success. But my experience has been that initially Claude Code was amazing and a "just take my frikkin money" product. Then Codex overtook CC and is much better at longer runs on hard problems. I've seen Claude Code literally just give up on a hard problem and tell me to buy something off the shelf. Whereas Codex's ability to profoundly increase the capabilities of a software org is a secret that's slowly getting out.

I don't have any relationship with any AI company, and honestly I was rooting for Anthropic, but Codex CLI is just way way better.

Also Codex CLI is cheaper than Claude Code.

I think Anthropic are going to have to somehow leapfrog OpenAI to regain the position they were in around June of this year. But right now they're being handed their hat.

latexr · 4 months ago

Feels like with every announcement there’s the same comment: “this LLM tool I’m using now is the real deal, the thing I was using previously and spending stupid amounts of money on looked good but failed at XYZ, this new thing is where it’s at”. Rinse and repeat.

Which means it wasn’t true any of the previous times, so why would it be true this time? It feels like an endless loop of the “friendship ended” meme with AI companies.

https://knowyourmeme.com/editorials/guides/what-is-the-frien...

It’s much more likely commenters are still in the honeymoon hype phase and (again) haven’t found the problems because they’re hyper focused on what the new thing is good at that the previous one wasn’t, ignoring the other flaws. I see that a lot with human relationships as well, where people latch on to new partners because they obviously don’t have the big problem that was a strain on the previous relationship. But eventually something else arises. Rinse and repeat.

iammrpayments · 4 months ago

It could also be covert advertising like you see in reddit

igneo676 · 4 months ago

People just aren't used to how LLMs and their tools are developed

Depending on the time of the year you can expect fresh updates from any given company on how their new models and tools perform and they'll generally blow the competition out of the water.

The trick is to either realize that your current tools will just become magically better in a few months OR lean in and switch companies as their tools and models update.

tim333 · 4 months ago

I think it's just a function of the models getting better. One is the best and then next month another overtakes it as so on.

raducu · 4 months ago

> “this LLM tool I’m using now is the real deal".

GPT-5 is not the final deal, but it's incredibly good as is at coding.

Anecdotal, but it's something completely else in terms of capabilities, ignore it at your own peril, but I think it will profoundly change software development.

zsoltkacsandi · 4 months ago

From my experience this mostly happened to Antrophic models, and not because of some honeymoon period, but after the introduction of their models, the model quality and the limits are starting to decline.

Many people are complaning about this on HN and Reddit. I do not have any proof, but there is a pattern, I suppose Antrophic first attracts customers, then starts to optimize costs/margins.

nickstinemates · 4 months ago

Another way to look at it is it's very cut throat. Switching costs are low for now, by design. The ways to get you locked in haven't been developed.

panarky · 4 months ago

Results > memes.

If we ship more for less because the new agent doesn't tap out, that's not a honeymoon, it's an upgrade.

Bombthecat · 4 months ago

Because there is no switching cost, it's easy to switch, just change the API and be done

jswny · 4 months ago

I find Codex CLI to be very good too, but it’s missing tons of features that I use in Claude Code daily that keep me from switching full time.

- Good bash command permission system

- Rollbacks coupled with conversation and code

- Easy switching between approval modes (Claude had a keybind that makes this easy)

- Ability to send messages while it’s working (Codex just queues them up for after it’s done, Claude injects them into the current task)

- Codex is very frustrating when I have to keep allowing it to run the same commands over and over, Claude this works well when I approve it to run a command for the session

- Agents (these are very useful for controlling context)

- A real plan mode (crucial)

- Skills (these are basically just lazy loaded context and are amazing)

- The sandboxing in codex is so confusing, commands fail all the time because they try to log to some system directory or use internet access which is blocked by default and hard to figure out

- Codex prefers python snippets to bash commands which is very hard to permission and audit

When Codex gets to feature parity, I’ll seriously look at switching, but until then it’s just a really good model wrapped in an okay harness

libraryofbabel · 4 months ago

I don't think anyone can reasonably argue against Claude Code being the most full-featured and pleasant to use of the CLI coding agent tools. Maybe some people like the Codex user experience for idiosyncratic reasons, but it (like Gemini CLI) still feels to me rather thrown together - a Claude Clone with a lot of rough edges.

But these CLI tools are still fairly thin wrappers around an LLM. Remember: they're "just an LLM in a while loop with access to tool calls." (I exaggerate, and I love Claude Code's more advanced features like "skills" as much as anyone, but at the core, that's what they are.) The real issue at stake is what is the better LLM behind the agent: is GPT-5 or Sonnet 4.5 better at coding. On that I think opinion is split.

Incidentally, you can run Claude Code with GPT-5 if you want a fair(er) comparison. You need a proxy like LiteLLM and you will have to use the OpenAI api and pay per-token, but it's not hard to do and quite interesting. I haven't used it enough to make a good comparison, however.

jacurtis · 4 months ago

Yeah I think the argument is the tooling vs agent. Maybe the OpenAI agent is performing better now, but the tooling is significantly better from anthropic.

The anthropic (ClaudeCode) tooling is best-in-class to me. You listed many features that I have become so reliant on now, that I consider them the Ante that other competitors need to even be considered.

I have been very impressed with the Anthropic agent for code generation and review. I have found the OpenAI agent to be significantly lacking by comparison. But to be fair, the last time I used OpenAI's agent for code was about a month ago, so maybe it has improved recently (not at all unreasonable in this space). But at least a month ago when using them side-by-side the codex CLI was VERY basic compared to the wealth of features and UI in the ClaudeCode CLI. The agents for Claude were also so much better than OpenAI, that it wasn't even close. OpenAI has always delivered me improper code (non-working or invalid) at a very high rate, whereas Claude is generally valid code, the debate is just whether it is the desired way to build something.

Footprint0521 · 4 months ago

I agree!! But this repo

https://github.com/just-every/code

Fixed all of these in a heartbeat. This has been a game changer

Palmik · 4 months ago

I am not sure copying your competitors feature-by-feature is always a good strategy. It can make the onboarding of your competitor's users easier, but lead to a worse product overall.

This is especially the case in a fast moving field such as this. You would not want to get stuck in the same local minimum as your competitor.

I would rather we have competing products that try different things to arrive at a better solution overall.

ryuuseijin · 4 months ago

I'm using opencode which I think is now very close to covering all the functionality of claude code. You can use GPT5 Codex with it along with most other models.

Deleted Comment

stared · 4 months ago

Claude Code has a lot or UX polish: https://newsletter.pragmaticengineer.com/p/how-claude-code-i...

Deleted Comment

jatora · 4 months ago

to fix having to approve commands over and over - use windows WSL. codex does not play nice with permissions/approvals on windows. WSL solves that completely

pkreg01 · 4 months ago

I totally agree. I remember the June magic as well - almost overnight my abilities and throughput were profoundly increased, I had many weeks of late nights in awe and wonder trying things that were beyond my ability to implement technically but within the bounds of my conceptual understanding.

Initially, I found Codex CLI with GPT-5 to be a substitute for Claude Code - now GPT-5 Codex materially surpasses it in my line of work, with a huge asterisk. I work in a niche industry, and Codex has generally poor domain understanding of many of the critical attributes and concepts. Claude happens to have better background knowledge for my tasks, so I've found that Sonnet 4.5 with Claude Code generally does a better job at scaffolding any given new feature. Then, I call in Codex to implement actual functionality since Codex does not have the "You're absolutely right" and mocked/placeholder implementation issues of CC, and just generally writes clean, maintainable, well-planned code. It's the first time I've ever really felt the whole "it's as good as a senior engineer" hype - I think, in most cases, GPT5-Codex finally is as good as a senior engineer for my specific use case.

I think Codex is a generally better product with better pricing, typically 40-50% cheaper for about the same level of daily usage for me compared to CC. I agree that it will take a genuinely novel and material advancement to dethrone Codex now. I think the next frontier for coding agents is speed. I would use CC over Codex if it was 2x or 3x as fast, even at the same quality level. Otherwise, Codex will remain my workhorse.

thecoppinger · 4 months ago

> trying things that were beyond my ability to implement technically but within the bounds of my conceptual understanding

This is a really neat way of describing the phenomenon I've been experiencing and trying to articulate, cheers!

catigula · 4 months ago

>I think, in most cases, GPT5-Codex finally is as good as a senior engineer for my specific use case.

This is beyond bananas to me given that I regularly see codex high and Gpt-5-high both fail to create basic react code slightly off the normal distribution.

bad_haircut72 · 4 months ago

Im not saying this is a paid endorsement but the internet is dead and I wonder what openAI would pay, if they could, to get such a glowing review as top comment on HN

neya · 4 months ago

For what it's worth, I'm not affiliated with Open AI (you can verify by my comment history [1] and account age) and I agree with the top comment. I do Elixir consulting primarily and nothing beats OpenAI's model at the moment for Elixir. Previously, their O3 models were quite decent. But, GPT-5 is really damn good. Claude code will unnecessarily try to complicate a problem solution.

[1] https://news.ycombinator.com/item?id=45491842

Palmik · 4 months ago

I'm not saying this was a paid comment, but if we're going to speculate, we could just as easily ask what Anthropic would pay, if they could, to drown out a strongly pro-OpenAI take sitting at the top of their own promotional HN thread.

That said, you're right that the broader internet (Reddit especially) is heavily astroturfed. It's not unusual to see "What's the best X?" threads seeded by marketers, followed by hoard of suspiciously aligned comments.

But without actual evidence, these kind of meta comments like yours (and mine) are just a cynical noise.

vietvu · 4 months ago

I heard this opinion a lot recently. Codex is getting better, and Claude is getting worse so it's must happen sooner or later. Well, it's competition so waiting for Claude to catch up. The web Claude Code is good, but they really need to fix their quota. It's unusable. I would choose a worse model (maybe at 90%), but has better quota and usable. Not to mention GPT-5 and GPT-5-codex seems catch up or even better now.

hluska · 4 months ago

Are you really going to call someone a shill? I’d argue that you’re why the internet is dying - a million options and you had to choose the most offensive?

brigandish · 4 months ago

The only way to tell human from AI now is disagreeableness, it’s the one thing the GPTs refuse to do. I can’t stand their cloying sycophancy but at least it means that serial complainers will gain some trust, at least for as long as Americans are leading the hunt and deciding to baby us.

Deleted Comment

visiondude · 4 months ago

I completely agree with this. The amount of unprompted “I used to love Claude Code but now…” content that follows the exact same pattern feels really off. All of these people post without any prompts for comparison, and OP even refused to share specifics so we have to take his claim as ‘trust me bro’

dbbk · 4 months ago

You're absolutely right!

a_victorp · 4 months ago

This is an underrated comment

h34t · 4 months ago

to be fair, they spent a lot on compute.

WXLCKNO · 4 months ago

I agree with this and actually Claude Code agrees with it too. I've had Codex cli (gpt-5-codex high) and claude code 4.5 sonnet (and sometimes opus 4.1) do the same lengthier task with the same prompt in cloned folders about 10x now and then I ask them to review the work in the other folder and determine who did the best job.

100% of the time Codex has done a far better job according to both Codex and Claude Code when reviewing. Meeting all the requirements where Claude would leave things out, do them lazily or badly and lose track overall.

Codex high just feels much smarter and more capable than Claude currently and even though it's quite a bit slower, it's work that I don't have to go over again and again to get it to the standards I want.

pkreg01 · 4 months ago

I share your observations. It's strange to see Anthropic loosing so much ground so fast - they seemed to be the first to crack long-horizon agentic tasks via what I can only assume is an extremely exotic RL process.

Now, I will concede that for non-coding long-horizon tasks, GPT-5 is marginally worse than Sonnet 4.5 in my own scaffolds. But GPT-5 is cheaper, and Sonnet 4.5 is about 2 months newer. However, for coding in a CLI context, GPT-5-Codex is night-and-day better. I don't know how they did it.

swah · 4 months ago

I haven't been able to get anything done with Codex. Claude Code is fast and "gets it". Also does better at running and testing its own stuff.

Its very odd because I was hoping they were very on par.

didibus · 4 months ago

Same, I find Codex not good to be honest. I have better success manually copy/pasting into GPT5 chat. There's something about Codex that just wants to change everything and use the weirdest tool commands.

It also often fails to escalate a command, it'll even be like, oh well I'm in a sandbox so I guess I can't do this, and will just not do it and try to find a workaround instead of escalating permission to do the command.

jacurtis · 4 months ago

The last time I used them both side by side was a month ago, so unless its significantly improved in the past month, I am genuinely surprised that someone is making the argument that Codex is competitive with ClaudeCode, let alone it somehow being superior.

ClaudeCode is used by me almost daily, and it continues to blow me away. I don't use Codex often because every time I have used it, the output is next to worthless and generally invalid. Even if it does get me what I eventually want, it will take much more prompting for me to get the functioning result. ClaudeCode on the other hand gets me good code from the initial prompt. I'm continually surprised at exactly how little prompting it requires. I have given it challenges with very vague prompts where it really exceeds my expectations.

acangiano · 4 months ago

I use both but I agree that they are generally not on par. I find Claude Code does a better job and doesn't overengineer as much. Where sometime Codex does better is in debugging a tough bug that stumps Claude Code. Codex is also more likely to get lazy and claiming to have finished a large task, when it reality it just wrote some placeholder lines. Claude has never done that. They might be on par soon, however, and I think Anthropic is playing a dangerous game with their limit enforcement on people who are on subscriptions.

veidr · 4 months ago

Me too, but I know it's not just people shilling, or on the take, because a bunch of people I know personally have moved from Claude Code to Codex, and say it's better.

For me, though, it's not remotely close. Codex has fucked up 95% of the 50-or-so tasks I asked it to do, while Claude Code fucks up only maybe 60%.

I'm big on asking LLMs to do the first major step of something, and then coming back later, and if it looks like it kinda sucks, just Ctrl-C and git revert that container/folder. And I also explicitly set up "here are the commands you need to run to self-check your work" every time. (Which Codex somewhat weirdly sometimes ignores with the explicit (false) claim that it skipped that step because it wasn't requested... hmm.)

So, those kinds of workflow preferences might be a factor, but I haven't seen Codex ever be good yet, and I regret the time I invested trying it too early.

cesarvarela · 4 months ago

Can you share an example of the tasks you found Codex being much better? From my experience Claude Code is much better.

intellectronica · 4 months ago

Codex works much better for long-running tasks that require a lot of planning and deep understanding.

Claude, especially 4.5 Sonnet, is a lot nicer to interact with, so it may be a better choice in cases where you are co-working with the agent. Its output is nicer, it "improvises" really well even if you give it only vague prompts. That's valueable for interactive use.

But for delegating complete tasks, Codex is far better. The benchmarks indicate that, as do most practicioners I talk to (and it is indeed my own experience).

In my own work, I use Codex for complete end-to-end tasks, and Claude Sonnet for interactive sessions. They're actually quite different.

mordymoop · 4 months ago

I'm on the same page here. I have seen this sentiment about Codex suddenly being good a few times now, so I booted Codex CLI thinking-high back up after a break and asked it to look for bugs. It promptly found five bugs that didn't actually exist. It was the kind of truly impressively stupid mistake that I haven't seen Claude Code make essentially ever, and made me wonder if this isn't the sort of thing that's making people downplay the power of LLMs for agentic coding.

simplify · 4 months ago

Same here. I tried codex a few days ago for a very simple task (remove any references of X within this long text string) and it fumbled it pretty hard. Very strange.

the_duke · 4 months ago

IMO gpt5-codex medium is much better as soon as the task becomes slightly complex, or the context grows a bit.

Sora 4.5 tends to randomly hallucinate odd/inappropriate decisions and goes to make stupid changes that have to be patched up manually.

Palmik · 4 months ago

Curiously, you yourself did not provide an example where, from your experience, Claude Code was much ebtter.

mmaunder · 4 months ago

I can not. We're all racing very hard to take full advantage of these new capabilities before they go mainstream. And to be honest, sharing problem domains that are particularly attractive would be sharing too much. Go forth and experiment. Have fun with it. You'll figure it out pretty fast. You can read my other post here about the kinds of problem spaces I'm looking at.

maherbeg · 4 months ago

Yeah this has been my experience as well. The Claude Code UI is still so much better, and the permissioning policy system is much better. Though I'm working on closing that gap by writing a custom policy https://github.com/openai/codex/blob/main/codex-rs/execpolic...

Kinda sick of Codex asking for approval to run tests for each test instance

mmaunder · 4 months ago

Ah the tension between cybersecurity best practices and productivity is brutal right now.

rtfeldman · 4 months ago

You don't have to use Codex in its terminal UI - e.g. you can use it in the Zed IDE out-the-box:

https://zed.dev/blog/codex-is-live-in-zed

lherron · 4 months ago

Still a toss-up for me which one I use. For deep work Codex (codex-high) is the clear winner, but when you need to knock out something small Claude Code (sonnet) is a workhorse.

Also CC tool usage is so much better! Many, many times I’ve seen Codex writing a python script to edit a file which seems to bypass the diff view so you don’t really know what’s going on.

bcrosby95 · 4 months ago

Yeah, after correcting it several times I've gotten Claude Code to tell me it didn't have the expertise to work in one of my problem domains. It was kinda surprising but also kinda refreshing that it knew when to give up. For better or worse I haven't noticed similar things with Codex.

mmaunder · 4 months ago

I've chosen problems with non-negotiable outcomes. In other words, problem domains where you either are able to clearly accomplish the very hard thing, or not, and there's no grey area. I've purposely chosen these kinds of problems to prove what AI agents are capable of, so that there is no debate in my mind. And with Codex I've accomplished the previously impossible. Unambiguously. Codex did this. Claude gave up.

It's as if there are two vendors saying they can give up incredibly superpowers for an affordable price, and only one of them actually delivers the full package. The other vendor's powers only work on Tuesdays, and when you're lucky. With that situation, in an environment as competitive as things currently stand, and given the trajectory we're on, Claude is an absolute non-starter for me. Without question.

kelvinjps10 · 4 months ago

I did the opposite I switched to Claude code once the released the new model last week of the one before, I tried using codex, but there was issues with the terminal and prompting (multiple characters getting deleted) I found Claude code to have more features and less bugs, like the edit on vim for the prompt being really useful and find it better to iterate. Also I like more its tool usage and the use of the shell. Sometimes codex prefer to use python instead of doing the equivalent shell command. Maybe it's like the other people say here, that codex it's better for long running tasks, I prefer to give Claude small tasks and I'm usually satisfied with the result and I like to work alongside the agent

catigula · 4 months ago

This is such an interesting perspective because I feel codex is hugely impressive but falls apart on any even remotely difficult task and is too autonomous and not eager enough.

Claude feels like a better fit for an experienced engineer. He's a positive, eager little fellow.

hn_saver · 4 months ago

How did you spend $70k per year for a tool that's not a single year old?

TkTech · 4 months ago

API pricing rates probably. If I take a look at my current usage since it came out, it'd be about $12000 CAD if paid at API rates. Ridiculously easy to rack up absurd bills via the API, and I'm mostly just using it for code review. Someone using it heavily could easily, easily get way over 70k.

koakuma-chan · 4 months ago

Why did they not buy a subscription? It would be a flat fee.

spoiler · 4 months ago

My experience is that if I know what I want, CC will produce better code, given I specify it correctly. The planning mode is great for this too, as we can "brainstorm" and what I have seen help a lot is if I ask questions about why it did a certain way. Often it'll figure out on its own why that's wrong, but sometimes it requires a bit of course correction.

On the other hand, last time I tried GPT-5 from Cursor, it was so disappointing. It kept getting confused while we were iterating on a plan, and I had to explain to it multiple times that it's thinking about the problem the same way. After a while I gave up, opened a new chat and gave it my own summary of the conversation (with the wrong parts removed) and then it worked fine. Maybe my initial prompt was vague, but it continually seemed to forget course corrections in that chat.

I mostly tend to use them more to save me from typing, rather than asking it to design things. Occasionally we do a more open ended discussion, but those have great variance. It seems to do better with such discussions online than within the coding tool (I've bounced maths/implementation ideas off of while writing shaders on a personal project)

baq · 4 months ago

gpt-5-high is amazing, but so slow I'll revert to sonnet when I know what I need done on a low level.

when making boilerplatish changes in the product in areas I'm not familiar with (it's a large codebase) gpt-5-high is a monster.

virtualritz · 4 months ago

When you say Claude Code, what model do you refer to? CC with Opus still outperforms Codex (gpt-5-codex) for me for anything I do (Rust, computer graphics-related).

However, Anthropic restricted Opus use for Max plan users 10 days or so ago severly (12-fold from 40h/week down to 5h week) [1].

Sonnet is a vastly inferioir model for my use cases (but still frequently writes better Rust code than Codex). So now I use Codex for planning and Sonnet for writing the code. However, I usually need about 3--5 loops with Codex reviewing, Sonnet fixing, rinse & repeat.

Before I could use one-shot Opus and review myself directly, and do one polish run following my review (also via Opus). That was possible from June--mid October but no more.

[1] https://github.com/anthropics/claude-code/issues/8449

deaux · 4 months ago

Agreed that Opus is stronger than Sonnet 4.5 and GPT-5 High. It's the bitter pill - bigger, more expensive models are just "smarter", even if it doesn't always show in synthetic benchmarks. Similar with o1-pro (now almost a year old, an eternity in this space) vs GPT-5 high. There's also GPT-5 Pro now, which comes at an API cost of $120/M output, and is also noticeably smarter, just like Opus.

They all like to push synthetic benchmarks for marketing, but to me there's zero doubt that both Anthropic and OpenAI are well aware that they're not representative of logical thinking and creativity.

p337 · 4 months ago

On the topic of comparing OpenAI models with Anthropocene models, I have a hybrid approach that seems really nice.

I set up an MCP tool to use gpt-5 with high reasoning with Claude Code (like tools with "personas" like architect, security reviewer, etc), and I feel that it SIGNIFICANTLY amplifies the performance of Claude alone. I don't see other people using LLMs as tools in these environments, and it's making me wonder if I'm either missing something or somehow ahead of the curve.

Basically instead of "do x (with details)" I say "ask the architect tool for how you should implement X" and it gets into this back and forth that's more productive because it's forcing some "introspection" on the plan.

jrk · 4 months ago

This is an established, though advanced, idea.

Sourcegraph Amp (https://sourcegraph.com/amp) has had this exact feature built in for quite a while: "ask the oracle" triggered an O1 Pro sub-agent (now, I believe, GPT-5 High), and searching can be delegated to cheaper, faster, longer-context sub-agents based on Gemini 2.5 Flash.

Rebuff5007 · 4 months ago

> We were heavy users of Claude Code ($70K+ spend per year)

Claude code has only been generally available since May last year (a year and half ago)... I'm surprised by the process that you are implying; within a year and a half, you both spent 70k on claude code, and knew enough about it and its competition to switch away from it? I dont think I'd be able to due diligence even if LLM evaluation was my fulltime job. Let alone the fact that the capabilities of each provider are changing dramatically every few weeks.

CompoundEyes · 4 months ago

Claude Code is still good but I don’t TRUST it. With Claude Code and Sonnet I’m expecting failure. I can get things done but there’s an administrative overhead of futzing around with markdown files, defensive commit hooks and unit tests to keep it on rails while managing the context panic. Codex CLI with gpt-5-codex high reasoning is next gen. I’m sure Sonnet 5 will match it soon. At that point I think a lot of the workflows people use in Claude Code will be obsolete and the sycophancy will disappear.

didibus · 4 months ago

Interesting, I find codex CLI is really bad, like the worst coding agent I've tried.

Fails to escalate permissions, gets derailed, loves changing too many things everywhere.

GPT5 is good, but codex is not.

dudeinhawaii · 4 months ago

In agreement. Large caveats that can explain differing opinions (that I've experienced) are:

* Is really only magic on Linux or WSL. Mediocre on Windows

* Is quite mediocre at UI code but exceptional at backend, engineering, ops, etc. (I use Claude to spruce up everything user facing -- Codex _can_ mirror designs already in place fairly well).

* Exceptional at certain languages, OK at others.

* GPT-5 and GPT-5-Codex are not the same. Both are models used by the Codex CLI and the GPT-5-Codex model is recent and fantastically good.

* Codex CLI is not "conversational" in the way that Claude is. You kind of interact with it differently.

I often wonder about the impact of different prompting styles. I think the WOW moment for me is that I am no longer returning to code to find tangled messes, duplicate silo'd versions of the same solution (in a different project in the same codebase), or strangely novice style coding and error handling.

As a developer for 20yrs+, using Codex running the GPT-5-Codex model has felt like working with a peer or near-peer for the first time ever. I've been able to move beyond smaller efforts and also make quite a lot of progress that didn't have to be undone/redone. I've used it for a solid month making phenomenal progress and able to offload as-if I had another developer.

Honestly, my biggest concern is that OpenAI is teasing this capable model and then pulls the rug in a month with an "update".

As for the topic at hand, I think Claude Code has without a doubt the best "harness" and interface. It's faster, painless, and has a very clean and readable way of laying out findings when troubleshooting. If there were a cheap and usable version of Opus... perhaps that would keep Claude Code on the cutting edge.

tstrimple · 4 months ago

> I've seen Claude Code literally just give up on a hard problem and tell me to buy something off the shelf

I've been seeing more of this lately despite initial excellent results. Not sure what's going on, but the value is certainly dropping for me. I'll have to check out codex. CLI integration is critical for me at this point. For me it is the only thing that actually helps realize the benefits of LLM models we have today. My last NixOS install was completely managed by Claude Code and it worked very well. This was the result of my latest frustrations:

https://i.imgur.com/C4nykhA.png

Though I know the statement it made isn't "true". I've had much better luck pursuing other implementation paths with CC in the same space. I could have prompted around this and should have reset the context much earlier but I was drunk "coding" at that point and drove it into a corner.

slaymaker1907 · 4 months ago

I haven’t used Codex a lot, but GPT-5 is just a bit smarter in agent mode than Claude 4.5. The most challenging thing I’ve used it for is for code review and GPT-5 somewhat regularly found intricate bugs that Claude missed. However, Claude seemed to be better at following directions exactly vs GPT-5 which requires a lot more precision.

mi_lk · 4 months ago

What model are you using respectively? Not sure I share your observations

mmaunder · 4 months ago

Have tried all and continue to eval regularly. I spend up to 14 hours a day. Currently recovering from a herniated disk because I spent 6 weeks sitting at a dining room table, 14 hours a day, leaning foward. Don't do that. lol. So my coverage is pretty good. I'm using GPT5-codex-high for 99% of my work. Also I have a team of 40 folks, about a third of which are software engineers and the other third are cybersecurity analysts, so I get feedback from them too and we go deep on our engineering calls re the latest learnings and capabilities.

unsupp0rted · 4 months ago

This was my experience too until a couple weeks ago, when Codex suddenly got dumbed down.

Initially, I had great success with codex medium- I could refactor with confidence, code generally ran on the first or second try, etc.

Then when that suddenly dumbed down to Claude Sonnet 3.5 quality I moved to GPT5 High to get back what had been lost. That was okay for a few days. Now GPT5 High has dropped to Claude Sonnet 3.5 quality.

There's nothing left to fallback to.

durron · 4 months ago

Do you find this to still be true with the Sonnet 4.5 model?

extr · 4 months ago

IMO Sonnet 4.5 is great but it just isn’t as comprehensive of a thinker. I love Anthropic and primarily use CC day to day but for any tricky problems or “high stakes, this must not have bugs” issues, I turn to Codex. I do find if you let Codex run on it its own too long it will produce comparably sloppy or lacking-in-vision type issues that people criticize Sonnet for, however.

theshrike79 · 4 months ago

I'm like 80% sure Sonnet 4.5 is just rebranded Opus.

Sonnet 4 was a coding companion, I could see what it was doing and it did what I asked.

Sonnet 4.5 is like Opus, it generates massive amounts of "helper scripts" and "bootstrap scripts" and all kinds of useless markdown documentation files even for the tinies PoC scripts.

esafak · 4 months ago

I don't. Sonnet is faster too.

mmaunder · 4 months ago

Yes. Sadly. And it really does make me sad. I was rooting for Anthropic. Still kinda am.

pinkbanana21 · 4 months ago

I found using z.ai a performant and cheap alternative for some tasks: https://z.ai/subscribe?ic=VWKNBI8LR8

Costs are 6x cheaper and it's way faster and good at test writing and tool calling. It some times can be a bit messy though so use Gemini or Claude or codex for that hard problems....

sabareesh · 4 months ago

Similar feeling. Seems it is good at certain things and if something doesnt work it want to do things simply and in turn becomes something that you didnt ask for and certain times opposite of what you wanted. On the other hand with codex certain time you feel the AGI but that is like 2 out of 10 sessions. This is primarily may be due to how complete the prompt and how well you define the problems.

poorman · 4 months ago

Totally agree. I was just thinking that I wouldn't want this feature for Claude Code but for Codex right now it would be great! I can simply let tasks run in Codex and I know it's going to eventually do what I want. Where as with Claude Code I feel like I have to watch it like a hawk and interrupt it when it goes off the rails.

purnesh · 4 months ago

My experience is similar, but for me, Claude Code is still better when designing or developing a frontend page from scratch. I have seen that Codex follows instructions a bit too literally, and the result can feel a little cold.

CC on the other hand feels more creative and has mostly given better UI.

Of course, once the page is ready, I switch to Codex to build further.

citizenpaul · 4 months ago

Does no one use Blocks Goose CLI anymore? I went to a hackathon in SF at the beginning of the year and it seemed like 90% of the groups used Goose to do something in their Agent project. I get that the CLI agent scene has exploded since then I just wonder what what is so much better in the competition?

blueside · 4 months ago

As we all know here, if the the title of this post was about Codex on the web, the top comment would have been about using Claude instead.

YMMV, but this definitely doesn't track with everything I've been seeing and hearing, which is that Codex is inferior to Claude on almost every measure.

dakom · 4 months ago

fwiw I'm happy to see this - been trying to tackle a hairy problem (rendering bugs) and both models fail, but:

1. Codex takes longer to fail and with less helpful feedback, but tends to at least not produce as many compiler errors 2. Claude fails faster and with more interesting back-and-forth, though tends to fail a bit harder

Neither of them are fixing the problems I want them to fix, so I prefer the faster iteration and back-and-forth so I can guide it better

So it's a bit surprising to me when so many people are pickign a "clear winner" that I prefer less atm

dboreham · 4 months ago

This is going to be situation normal for 10 years: everyone will need to keep track of "model-du-jour" as each vendor makes incremental improvements.

nadermx · 4 months ago

I'm just happy alternatives exist.

pythonbase · 4 months ago

What is your general use case with Claude Code / Codex? $70K/year is a significant spend.

013 · 4 months ago

What are the usage limits for Codex compared to Claude Code?

tonyhart7 · 4 months ago

is is that better tho????

I thought claude code is still better in tool calling and something like that

asdev · 4 months ago

do you use the CLI or the web UI? or both?

jakenuts · 4 months ago

Same!

TechDebtDevin · 4 months ago

Why waste so much time making your devs dumb. Just code brother, this is the dumbest tech in software right now wasting peoples brains away.

mvkel · 4 months ago

This is why Anthropic is a zombie company.

They put all of their eggs in the coding basket, with the rest of their mission couched as "effective altruism," or "safetyism," or "solving alignment," (all terms they are more loudly attempting to distance themselves from[0], because it's venture kryptonite).

Meanwhile, all OpenAI had to do was point their training cannon at it for a run, and suddenly Anthropic is irrelevant. OpenAI's focus as a consumer company (and growing as a tool company) is a safe, venture-backable bet.

Frontier AI doesn't feel like a zero-sum game, but for now, if you're betting on AI at all, you can really only bet on OpenAI, like Tesla being a proxy for the entire EV industry.

[0] https://forum.effectivealtruism.org/posts/53Gc35vDLK2u5nBxP/...

F7F7F7 · 4 months ago

For non-vibe coding purposes I've found that my $200 Claude (Claude Code) account regularly outperformed my $200 ChatGPT (Codex) account. This was after 2 months of heavily testing both mostly in Terminal TUI/CLI form and most recently with the latest VSCode/Cursor incarnations.

Even with the additional Sora usage and other bells & whistles that ChatGPT @ $200 provides, Claude provides more value for my use cases.

Claude Code is just a lot more comfortable being in your workflow and being a companion or going full 'agent(s)' and running for 30 minutes on one ticket. It's also a lot happier playing with Agents from other APIs.

There's nothing wrong with Anthropic wanting to completely own that segment and not have aspirations of world domination like OpenAI. I don't see how that's a negative.

If anything, the more ChatGPT becomes a 'everything app' the less likely I am to hold on to my $20 account after cancelling the $200 account. I'm finding the more it knows about me the more creeped out and "I didn't ask for this" I become.

I feel like these background agents still aren't doing what I want from a developer experience perspective. Running in an inaccessible environment that pushes random things to branches that I then have to checkout locally doesn't feel great.

AI coding should be tightly in the inner dev loop! PRs are a bad way to review and iterate on code. They are a last line of defense, not the primary way to develop.

Give me an isolated environment that is one click hooked up to Cursor/VSCode Remote SSH. It should be the default. I can't think of a single time that Claude or any other AI tool nailed the request on the first try (other than trivial things). I always need to touch it up or at least navigate around and validate it in my IDE.

ewoodrich · 4 months ago

Right, that is closer to what I was hoping this announcement would be. I really just want a (mobile/web) companion to whatever CLI environment I have Claude Code running in. That would perfectly fill in the exact niche missing in my local dev server VM setup I remote into with any combination of SSH, VS Code Remote, or via Web (VS Code Tunnel from vscode.dev and a ttyd remote CLI session in the browser).

It would be great to be able to check in on Claude on a walk or something to make sure it hasn't gone off the rails or send it a quick "LGTM" to keep moving down a large PLAN.md file without being tethered to a keyboard and monitor. I can SSH from my phone but the CLI ergonomics are ... not great with an on screen keyboard, when all it really needs is just needs a simple threaded chat UI.

I've seen a couple Github projects and "Happy Coder" on a Show HN which I haven't got around to setting up yet which seem in the ballpark of what I want, but a first party integration would always be cool.

Yeroc · 4 months ago

I tried Happy Coder for a bit. It seemed exactly what I was missing but about 1/2 the time session notifications weren't coming through and the developers of the tool seem busy pushing it off in other directions rather than in making the core functionality bullet-proof so I gave up on it. Unfortunate. Hopefully something else pops up or Anthropic bakes it into their own tooling.

luisml77 · 4 months ago

I agree and I also think the problem is deeper than that. It's about not being able to do most code testing and debugging remotely. You can't really test anything remotely really... Its in an ephemeral container without any of your data, just your repo. You can't have the model do npm run dev and browse to see the webpage, click around, etc. You can't compile or run anything heavy, you can't persist data across sessions/days, etc.

I like the idea of background agents running in the cloud but it has to be a more persistent environment. It also has to run on a GUI so it can develop web applications or run the programs we are developing, and run them properly with the GUI and requiring clicking around, typing things etc. Computer use, is what we need. But that would probably be too expensive to serve to the masses with the current models

daxfohl · 4 months ago

Definitely sounds cool. But the problem hasn't even been solved locally yet. Distributed microservices, 3rd party dependencies, async callbacks, reasonable test data, unsatisfiable validations, etc. Every company has their own hacked together local testing thing that mostly doesn't work.

That said, maybe this is the turning point where these companies work toward solving it in earnest, since it's a key differentiator of their larger PLATFORM and not just a cost. Heck, if they get something like that working well, I'd pay for it even without the AI!

Edit: that could end up being really slick too if it was able to learn from your teammates and offer guidance. Like when you're checking some e2e UI flows but you need a test item that has some specific detail, it maybe saw how your teammate changed the value or which item they used or created, and can copy it for you. "Hey it looks like you're trying to test this flow. Here's how Chen did it. Want me to guide you through that?" They can't really do that with just CLI, so the web interface could really be a game changer if they take full advantage of it.

mdeeks · 4 months ago

What you're describing feels like the next major evolution and is likely years away (and exciting!).

I'm mainly aiming for a good experience with what we have today. Welding an AI agent onto my IDE turned out to be great. The next incremental step feels like being able to parallelize that. I want four concurrent IDEs with AI welded onto it.

elpakal · 4 months ago

> PRs are a bad way to review and iterate on code

idk, we’ve (humans) gotten this far with them. I don’t think they are the right tool for AI generated code and coding agents though, and that these circles are being forced to fit into those squares. imho it’s time for an AI-native git or something.

mdeeks · 4 months ago

PRs work well for what they are. Ship off some changes you're strongly confident about and have another human who has a lot of context read through it and double check you. It's for when you think you've finished your inner loop.

AI is more akin to pair programming with another person sitting next to you. I don't want to ship a PR or even a branch off to someone sitting next to me. I want to discuss and type together in real time.

sails · 4 months ago

Agree, each agent creating a PR and then coordinating merges is a pain.

I’d like

- agent to consolidate simple non-conflicting PRs

- faster previews and CI tests (Render currently)

- detect and suggest solutions for merge conflicts

Codex web doesn’t update the PR which is also something to change, maybe a setting, but for web Code agents (?) I’d like the PR once opened to stay open

Also PRs need an overhaul in general. I create lots of speculative agents, if I like the solution I merge, leading to lots of PRs

archon810 · 4 months ago

Thank you. Every time these agentic cloud tools come out, I wonder to myself whether I'm not using them right or misunderstand vs, say, local Cursor development paradigm.

Plus they generate so much noise with all the extra commits and comments that go to everyone in slack and email rather than just me.

icelancer · 4 months ago

I just run the agent directly on separate testing/dev servers via remote-ssh in VS Code to have an IDE to sanity check stuff. Just far simpler than local dev and other nonsense.

cyrusradfar · 4 months ago

this is a great point. The inner / outer loop is big. I think AI pushing PRs is kind of like pushing drafts to the public in social media. I don't want folks seeing PRs and such until I feel good about it. It adds a lot of noise, and increases build costs unless your CI/CD treats them differently which I don't know anyone doing.

justinram11 · 4 months ago

Have you checked out Ona [1] (gitpod's pivot)?

[1] https://ona.com/

mdeeks · 4 months ago

This is possibly what I want? It's hard to tell from all of the marketing on the site.

I want to run a prompt that operates in an isolated environment that is open in my IDE where I can iterate with the AI. I think maybe it can do this?

pdntspa · 4 months ago

Yes but pointy-haired bosses are much more amenable to the sales pitch of, "insert story, receive PR"

asdev · 4 months ago

so the biggest issue is having to pull down and manually edit changes? can't you just @claude on the PR to make any changes?

mdeeks · 4 months ago

Yes, but my point is often times I don't want to. Sometimes there are changes I can make it seconds. I don't want to wait 15+ seconds for an AI that might do it wrong or do too much.

Also it isn't always about editing. It is about seeing the surrounding code, navigating around, and ensuring the AI did the right thing in all of the right places.

TechDebtDevin · 4 months ago

Huge waste of time. You are being sold a bill of goods whose only purpose is to make you a dumb dev. Like woah, an llm can use cdp!! Who cares. Cant wait till people start waking up to this grift. These things are making people so dumb and a few richer, thats it.

Dead Comment

kanjun · 4 months ago

Hey, Kanjun from Imbue here! This is exactly why we built Sculptor (https://imbue.com/sculptor), a desktop UI for Claude Code.

Each agent has its own isolated container. With Pairing Mode, you can sync the agent's code and git state directly into your local Cursor/any IDE so you can instantly validate its work. The sync is bidirectional so your local changes flow back to the agent in realtime.

Happy to answer any questions - I think you'll really like the tight feedback loop :)