OpenAI Codex CLI: Lightweight coding agent that runs in your terminal

I tried one task head-to-head with Codex o4-mini vs Claude Code: writing documentation for a tricky area of a medium-sized codebase.

Claude Code did great and wrote pretty decent docs.

Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.

I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.

I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.

strangescript · 10 months ago

Claude Code still feels superior. o4-mini has all sorts of issues. o3 is better but at that point, you aren't saving money so who cares.

I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.

artdigital · 10 months ago

Claude Code is just way too expensive.

These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.

ekabod · 10 months ago

"gemini 2.5 pro exp" is superior to Claude Sonnet 3.7 when I use it with Aider [1]. And it is free (with some high limit).

[1]https://aider.chat/

Aeolun · 10 months ago

> Its not cheap, but its by far the best, most consistent experience I have had.

It’s too expensive for what it does though. And it starts failing rapidly when it exhausts the context window.

ilaksh · 10 months ago

Did you try the same exact test with o3 instead? The mini models are meant for speed.

gklitt · 10 months ago

I want to but I’ve been having trouble getting o3 to work - lots of errors related to model selection.

ksec · 10 months ago

Sometimes I see in certain areas AI / LLM is absolutely crushing those jobs, a whole category will be gone in next 5 to 10 years as they are already 80 - 90% mark. They just need another 5 - 10% as they continue to get improvement and they are already cheaper per task.

Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.

The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.

enether · 10 months ago

there was one post that detailed how those OpenAI models hallucinate and double down on thier mistakes by "lying" - it speculated on a bunch of interesting reasons why this may be the case

I wonder if this is what's causing it to do badly in these cases

victor9000 · 10 months ago

> I no longer have the “real” prime I generated during that earlier session... I produced it in a throw‑away Python process, verified it, copied it to the clipboard, and then closed the interpreter.

AGI may well be on its way, as the mode is mastering the fine art of bullshitting.

Deleted Comment

kristopolous · 10 months ago

Ever use Komment? They've been in the game a awhile. Looks pretty good

Fingers crossed for this to work well! Claude Code is pretty excellent.

I’m actually legitimately surprised how good it is, since other coding agents I’ve used before have mostly been a letdown, which made me only use Claude in direct change prompting with Zed (“implement xyz here”, “rewrite this function with abc”, etc), so very hands-on.

So I’ve went into trying out Claude Code rather pessimistically, and now I’m using it all the time! Sure, it ends up costing a bunch, but it’s easy to justify $15 for a prompting session if the end result is a mostly complete PR, done much faster.

All that is to say - competition is good, fingers crossed for codex!

therealmarv · 10 months ago

Claude Code has a closed license https://github.com/anthropics/claude-code/blob/main/LICENSE....

There is fork named Anon Kode https://github.com/dnakov/anon-kode which can use more models and non-Anthropic ones. But the license is unclear for it.

It's interesting to see codex to be Apache License. Maybe somebody extends it to be usable with competing models.

WatchDog · 10 months ago

If it's a fork of the proprietary code, the license is pretty clear, it's violating copyright.

Now whether or not anthropic care enough to enforce their license is separate issue, but it seems unwise to make much of an investment in it.

cube2222 · 10 months ago

In terms of terminal-based and open-source, I think aider is the most popular one.

jwr · 10 months ago

Seconded. I was surprised by how good Claude Code is, even for less mainstream languages (Clojure). I am happy there is competition!

kurtis_reed · 10 months ago

Fingers crossed for what?

dzhiurgis · 10 months ago

I started using claude code everyday. It’s kinda expensive and hallucinates a ton (tho with custom prompt i’ve mostly tamed it).

Hope more competition can bring price down.

retinaros · 10 months ago

too expensive. I cant understand why everyone is into claude code vs using claude in cursor or windsurf.

danenania · 10 months ago

I think it depends a lot on how you value your time. I'm personally willing to spend hundreds or thousands per month happily if it saves me enough hours. I'd estimate that if I were to do consulting, I'd likely be charging in the $150-250 per hour range, so by my math, it's pretty easy to justify any tools that save me even a few hours per month.

ChadMoran · 10 months ago

Claude Code has been able to produce results equivalent to a junior engineer. I spent about $300 API credits in a month but got the value out of it far surpassing that.

benzible · 10 months ago

If you have AWS credits...

export CLAUDE_CODE_USE_BEDROCK=1

export ANTHROPIC_MODEL=us.anthropic.claude-3-7-sonnet-20250219-v1:0

export ANTHROPIC_API_TYPE=bedrock

_neil · 10 months ago

Anecdotally, Claude code performs much better than Claude within Cursor. Not sure if it’s a system prompt thing or if I’ve just convinced myself of it because the aesthetic is so much better, but either way the end result feels better to me.

drusepth · 10 months ago

I tried switching from Claude Code to both Cursor and Windsurf. Neither of the latter IDEs fully support MCP implementations (missing basic things like tool definitions and other vital features last time I tried) and both have been riddled with their own agentic flow issues (cursor going down for a week a bit ago, windsurf requiring paid upgrades to "get around" bugs, etc).

This is all ignoring the controversies that pop up around e.g. Cursor seemingly every week. As an IDE, they're both getting there -- but I have objectively better results in Claude Code.

tcdent · 10 months ago

that's what my Ramp card is for.

seriously though, anything that makes me smarter and more productive has a threshold in the thousands-of-dollars range, not hundreds

newlisp · 10 months ago

Why is using cursor with sonnet cheaper than using claude code?

swyx · 10 months ago

related demo/intro video: https://x.com/OpenAIDevs/status/1912556874211422572

this is a direct answer to claude code which has been shipping furiously: https://x.com/_catwu/status/1903130881205977320

and is not open source; there are unverified comments that they have DMCA'ed decompilations https://x.com/vikhyatk/status/1899997417736724858?s=46

by total coincidence we're releasing our claude code interview later this week that touches on a lot of these points + why code agent CLIs are an actually underrated point in the SWE design space

(TLDR you can use it like a linux utility - similar to @simonw's `llm` - to sprinkle intelligence in all sorts of things like CI/PR review without the overhead of buying a Devin or a Copilot SaaS)

if you are a Claude Code (and now OAI Codex) power user we want to hear use cases - CFP closing soon, apply here https://sessionize.com/ai-engineer-worlds-fair-2025

axkdev · 10 months ago

Hey! The weakest part of Claude Code I think is that it's closed source and locked to Claude models only. If you are looking for inspiration, Roo is the the best tool atm. It offers far more interesting capabilities. Just to name some - user defines modes, the built in debug mode is great for debugging, architecture mode. You can, for example, ask it to summarize some part of the running task and start a new task with fresh context. And, unlike in Claude Code, in Roo the LLM will actually follow your custom instructions (seriously, guys, that Claude.md is absolutely useless)! The only drawback of Roo, in my opinion, is that it is NOT a cli.

there's goose. plandex and aider also there's kilo as a new fork of roo.

senko · 10 months ago

I got confused, so to clarify to myself and others - codex is open source, claude code isn't, and the referenced decompilation tweets are for claude code.

asadm · 10 months ago

These days, I usually paste my entire (or some) repo into gemini and then APPLY changes back into my code using this handy script i wrote: https://github.com/asadm/vibemode

I have tried aider/copilot/continue/etc. But they lack in one way or the other.

jwpapi · 10 months ago

It’s not just about saving money or making less mistakes its also about iteration speed. I can’t believe this process is remotely comparable to aider.

In aider everything is loaded in memory I can add drop files in terminal, discuss in terminal, switch models, every change is a commit, run terminal commands with ! at the start.

Full codebase is more expensive and slower than relevant files. I understand when you don’t worry about the cost, but at reasonable size pasting full codebase can’t be really a thing.

I am at my 5th project in this workflow and these are of different types too:

- an embedded project for esp32 (100k tokens)

- visual inertial odometry algorithm (200k+ tokens)

- a web app (60k tokens)

- the tool itself mentioned above (~30k tokens)

it has worked well enough for me. Other methods have not.

t1amat · 10 months ago

Use a tool like repomix (npm), which has extensions in some editors (at least VSCode) that can quickly bundle source files into a machine readable format

brandall10 · 10 months ago

Why not just select Gemini Pro 2.5 in Copilot with Edit mode? Virtually unlimited use without extra fees.

Copilot used to be useless, but over the last few months has become quite excellent once edit mode was added.

copilot (and others) try to be too smart and do context reduction (to save their own wallets). I want ENTIRETY of the files I attached to context, not RAG-ed version of it.

fasdfasdf11234 · 10 months ago

Isn't this similar to https://aider.chat/docs/usage/copypaste.html

Just checked to see how it works. It seems that it does all that you are describing. The difference is in the way that it provides the files - it doesn't use xml format.

If you wish you could /add * to add all your files.

Also deducing from this mode it seems that any file that you add to aider chat with /add has its full contents added to the chat context.

But hey I might be wrong. Did a limited test with 3 files in project.

that’s correct. aider doesn’t RAG on files which is good. I don’t use it because 1) UI is so slow and clunky 2) using gemini 2.5 via api in this way (huge context window) is expensive but also heavily rate limited at this point. No such issue when used via aistudio ui.

ramraj07 · 10 months ago

I felt it loses track of things on really large codebases. I use 16x prompt to choose the appropriate files for my question and let it generate the prompt.

do you mean gemini? I generally notice pretty great recall UPTO 200k tokens. It's ~OK after that.

gizmodo59 · 10 months ago

This is pretty neat! I was able to use it for few use cases where it got it right the first time. The ability to use a screenshot to create an application is nice for rapid prototyping. And good to see them open sourcing it unlike claude.

kumarm · 10 months ago

First experience is not great. Here are the issues to start using codex:

1. Default model used doesn't work and you get error: system OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.

2. You have to switch to model o4-mini-2025-04-16 or some other model using /model. Now if you exit codex, you are back to default model and again have to switch everytime.

3. Crashed the first time with NodeJS error.

But after initial hickups seems to work and still checking how good/bad it is compared to claude code (which I love except for context size limits)

shekhargulati · 10 months ago

Not sure why they used React for a CLI. The code in the repo feels like it was written by an LLM—too many inline comments. Interestingly, their agent's system prompt mentions removing inline comments https://github.com/openai/codex/blob/main/codex-cli/src/util....

> - Remove all inline comments you added as much as possible, even if they look normal. Check using \`git diff\`. Inline comments must be generally avoided, unless active maintainers of the repo, after long careful study of the code and the issue, will still misinterpret the code without the comments.

kristianp · 10 months ago

I find it irritating too when companies use react for a command line utility. I think its just my preference for anything else but javascript.

mgdev · 10 months ago

Strictly worse than Claude Code presently, but I hope since it's open source that changes quickly.

killerstorm · 10 months ago

Given that Claude Code only works with Sonnet 3.7 which has severe limitations, how can it be "strictly worse"?

Whatever Claude Code is doing in the client/prompting is making much better use of 3.7 than any other client I'm using that also uses 3.7. This is especially true for when you bump up against context limits; it can successfully resume with a context reset about 90% of the time. MCP Commander [0] was built almost 100% using Claude Code and pretty light intervention. I immediately felt the difference in friction when using Codex.

I also spent a couple hours picking apart Codex with the goal of adding Sonnet 3.7 support (almost there). The actual agent loop they're using is very simple. Not to say that's a bad thing, but they're offloading all planning and workflow execution to the agent itself. That's probably the right end state to shoot for long-term, but given the current state of these models I've had much better success offloading task tracking to some other thing - even if that thing is just a markdown checklist. (I wrote about my experience [1] building AI Agents last year.)

[0]: https://mcpcommander.com/

[1]: https://mg.dev/lessons-learned-building-ai-agents/