I tried one task head-to-head with Codex o4-mini vs Claude Code: writing documentation for a tricky area of a medium-sized codebase.
Claude Code did great and wrote pretty decent docs.
Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.
I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.
I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.
Claude Code still feels superior. o4-mini has all sorts of issues. o3 is better but at that point, you aren't saving money so who cares.
I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.
These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.
Sometimes I see in certain areas AI / LLM is absolutely crushing those jobs, a whole category will be gone in next 5 to 10 years as they are already 80 - 90% mark. They just need another 5 - 10% as they continue to get improvement and they are already cheaper per task.
Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.
The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.
there was one post that detailed how those OpenAI models hallucinate and double down on thier mistakes by "lying" - it speculated on a bunch of interesting reasons why this may be the case
> I no longer have the “real” prime I generated during that earlier session... I produced it in a throw‑away Python process, verified it, copied it to the clipboard, and then closed the interpreter.
AGI may well be on its way, as the mode is mastering the fine art of bullshitting.
by total coincidence we're releasing our claude code interview later this week that touches on a lot of these points + why code agent CLIs are an actually underrated point in the SWE design space
(TLDR you can use it like a linux utility - similar to @simonw's `llm` - to sprinkle intelligence in all sorts of things like CI/PR review without the overhead of buying a Devin or a Copilot SaaS)
Hey! The weakest part of Claude Code I think is that it's closed source and locked to Claude models only. If you are looking for inspiration, Roo is the the best tool atm. It offers far more interesting capabilities. Just to name some - user defines modes, the built in debug mode is great for debugging, architecture mode. You can, for example, ask it to summarize some part of the running task and start a new task with fresh context. And, unlike in Claude Code, in Roo the LLM will actually follow your custom instructions (seriously, guys, that Claude.md is absolutely useless)! The only drawback of Roo, in my opinion, is that it is NOT a cli.
I got confused, so to clarify to myself and others - codex is open source, claude code isn't, and the referenced decompilation tweets are for claude code.
These days, I usually paste my entire (or some) repo into gemini and then APPLY changes back into my code using this handy script i wrote: https://github.com/asadm/vibemode
I have tried aider/copilot/continue/etc. But they lack in one way or the other.
It’s not just about saving money or making less mistakes its also about iteration speed. I can’t believe this process is remotely comparable to aider.
In aider everything is loaded in memory I can add drop files in terminal, discuss in terminal, switch models, every change is a commit, run terminal commands with ! at the start.
Full codebase is more expensive and slower than relevant files. I understand when you don’t worry about the cost, but at reasonable size pasting full codebase can’t be really a thing.
Use a tool like repomix (npm), which has extensions in some editors (at least VSCode) that can quickly bundle source files into a machine readable format
copilot (and others) try to be too smart and do context reduction (to save their own wallets). I want ENTIRETY of the files I attached to context, not RAG-ed version of it.
Just checked to see how it works. It seems that it does all that you are describing. The difference is in the way that it provides the files - it doesn't use xml format.
If you wish you could /add * to add all your files.
Also deducing from this mode it seems that any file that you add to aider chat with /add has its full contents added to the chat context.
But hey I might be wrong. Did a limited test with 3 files in project.
that’s correct. aider doesn’t RAG on files which is good. I don’t use it because 1) UI is so slow and clunky 2) using gemini 2.5 via api in this way (huge context window) is expensive but also heavily rate limited at this point. No such issue when used via aistudio ui.
I felt it loses track of things on really large codebases. I use 16x prompt to choose the appropriate files for my question and let it generate the prompt.
Fingers crossed for this to work well! Claude Code is pretty excellent.
I’m actually legitimately surprised how good it is, since other coding agents I’ve used before have mostly been a letdown, which made me only use Claude in direct change prompting with Zed (“implement xyz here”, “rewrite this function with abc”, etc), so very hands-on.
So I’ve went into trying out Claude Code rather pessimistically, and now I’m using it all the time! Sure, it ends up costing a bunch, but it’s easy to justify $15 for a prompting session if the end result is a mostly complete PR, done much faster.
All that is to say - competition is good, fingers crossed for codex!
I think it depends a lot on how you value your time. I'm personally willing to spend hundreds or thousands per month happily if it saves me enough hours. I'd estimate that if I were to do consulting, I'd likely be charging in the $150-250 per hour range, so by my math, it's pretty easy to justify any tools that save me even a few hours per month.
Claude Code has been able to produce results equivalent to a junior engineer. I spent about $300 API credits in a month but got the value out of it far surpassing that.
Anecdotally, Claude code performs much better than Claude within Cursor. Not sure if it’s a system prompt thing or if I’ve just convinced myself of it because the aesthetic is so much better, but either way the end result feels better to me.
I tried switching from Claude Code to both Cursor and Windsurf. Neither of the latter IDEs fully support MCP implementations (missing basic things like tool definitions and other vital features last time I tried) and both have been riddled with their own agentic flow issues (cursor going down for a week a bit ago, windsurf requiring paid upgrades to "get around" bugs, etc).
This is all ignoring the controversies that pop up around e.g. Cursor seemingly every week. As an IDE, they're both getting there -- but I have objectively better results in Claude Code.
This is pretty neat! I was able to use it for few use cases where it got it right the first time. The ability to use a screenshot to create an application is nice for rapid prototyping. And good to see them open sourcing it unlike claude.
First experience is not great. Here are the issues to start using codex:
1. Default model used doesn't work and you get error:
system
OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.
2. You have to switch to model o4-mini-2025-04-16 or some other model using /model. Now if you exit codex, you are back to default model and again have to switch everytime.
3. Crashed the first time with NodeJS error.
But after initial hickups seems to work and still checking how good/bad it is compared to claude code (which I love except for context size limits)
Not sure why they used React for a CLI. The code in the repo feels like it was written by an LLM—too many inline comments. Interestingly, their agent's system prompt mentions removing inline comments https://github.com/openai/codex/blob/main/codex-cli/src/util....
> - Remove all inline comments you added as much as possible, even if they look normal. Check using \`git diff\`. Inline comments must be generally avoided, unless active maintainers of the repo, after long careful study of the code and the issue, will still misinterpret the code without the comments.
Whatever Claude Code is doing in the client/prompting is making much better use of 3.7 than any other client I'm using that also uses 3.7. This is especially true for when you bump up against context limits; it can successfully resume with a context reset about 90% of the time. MCP Commander [0] was built almost 100% using Claude Code and pretty light intervention. I immediately felt the difference in friction when using Codex.
I also spent a couple hours picking apart Codex with the goal of adding Sonnet 3.7 support (almost there). The actual agent loop they're using is very simple. Not to say that's a bad thing, but they're offloading all planning and workflow execution to the agent itself. That's probably the right end state to shoot for long-term, but given the current state of these models I've had much better success offloading task tracking to some other thing - even if that thing is just a markdown checklist. (I wrote about my experience [1] building AI Agents last year.)
Claude Code did great and wrote pretty decent docs.
Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.
I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.
I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.
I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.
These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.
[1]https://aider.chat/
It’s too expensive for what it does though. And it starts failing rapidly when it exhausts the context window.
Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.
The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.
recommended read - https://transluce.org/investigating-o3-truthfulness
I wonder if this is what's causing it to do badly in these cases
AGI may well be on its way, as the mode is mastering the fine art of bullshitting.
Deleted Comment
Deleted Comment
this is a direct answer to claude code which has been shipping furiously: https://x.com/_catwu/status/1903130881205977320
and is not open source; there are unverified comments that they have DMCA'ed decompilations https://x.com/vikhyatk/status/1899997417736724858?s=46
by total coincidence we're releasing our claude code interview later this week that touches on a lot of these points + why code agent CLIs are an actually underrated point in the SWE design space
(TLDR you can use it like a linux utility - similar to @simonw's `llm` - to sprinkle intelligence in all sorts of things like CI/PR review without the overhead of buying a Devin or a Copilot SaaS)
if you are a Claude Code (and now OAI Codex) power user we want to hear use cases - CFP closing soon, apply here https://sessionize.com/ai-engineer-worlds-fair-2025
I have tried aider/copilot/continue/etc. But they lack in one way or the other.
In aider everything is loaded in memory I can add drop files in terminal, discuss in terminal, switch models, every change is a commit, run terminal commands with ! at the start.
Full codebase is more expensive and slower than relevant files. I understand when you don’t worry about the cost, but at reasonable size pasting full codebase can’t be really a thing.
- an embedded project for esp32 (100k tokens)
- visual inertial odometry algorithm (200k+ tokens)
- a web app (60k tokens)
- the tool itself mentioned above (~30k tokens)
it has worked well enough for me. Other methods have not.
Copilot used to be useless, but over the last few months has become quite excellent once edit mode was added.
Just checked to see how it works. It seems that it does all that you are describing. The difference is in the way that it provides the files - it doesn't use xml format.
If you wish you could /add * to add all your files.
Also deducing from this mode it seems that any file that you add to aider chat with /add has its full contents added to the chat context.
But hey I might be wrong. Did a limited test with 3 files in project.
Deleted Comment
I’m actually legitimately surprised how good it is, since other coding agents I’ve used before have mostly been a letdown, which made me only use Claude in direct change prompting with Zed (“implement xyz here”, “rewrite this function with abc”, etc), so very hands-on.
So I’ve went into trying out Claude Code rather pessimistically, and now I’m using it all the time! Sure, it ends up costing a bunch, but it’s easy to justify $15 for a prompting session if the end result is a mostly complete PR, done much faster.
All that is to say - competition is good, fingers crossed for codex!
There is fork named Anon Kode https://github.com/dnakov/anon-kode which can use more models and non-Anthropic ones. But the license is unclear for it.
It's interesting to see codex to be Apache License. Maybe somebody extends it to be usable with competing models.
Now whether or not anthropic care enough to enforce their license is separate issue, but it seems unwise to make much of an investment in it.
Hope more competition can bring price down.
export CLAUDE_CODE_USE_BEDROCK=1
export ANTHROPIC_MODEL=us.anthropic.claude-3-7-sonnet-20250219-v1:0
export ANTHROPIC_API_TYPE=bedrock
This is all ignoring the controversies that pop up around e.g. Cursor seemingly every week. As an IDE, they're both getting there -- but I have objectively better results in Claude Code.
seriously though, anything that makes me smarter and more productive has a threshold in the thousands-of-dollars range, not hundreds
1. Default model used doesn't work and you get error: system OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.
2. You have to switch to model o4-mini-2025-04-16 or some other model using /model. Now if you exit codex, you are back to default model and again have to switch everytime.
3. Crashed the first time with NodeJS error.
But after initial hickups seems to work and still checking how good/bad it is compared to claude code (which I love except for context size limits)
> - Remove all inline comments you added as much as possible, even if they look normal. Check using \`git diff\`. Inline comments must be generally avoided, unless active maintainers of the repo, after long careful study of the code and the issue, will still misinterpret the code without the comments.
I also spent a couple hours picking apart Codex with the goal of adding Sonnet 3.7 support (almost there). The actual agent loop they're using is very simple. Not to say that's a bad thing, but they're offloading all planning and workflow execution to the agent itself. That's probably the right end state to shoot for long-term, but given the current state of these models I've had much better success offloading task tracking to some other thing - even if that thing is just a markdown checklist. (I wrote about my experience [1] building AI Agents last year.)
[0]: https://mcpcommander.com/
[1]: https://mg.dev/lessons-learned-building-ai-agents/