All these new tools are so exciting, but running untrusted code which auto-updates itself is blocking me from trying these tools.
I wish for a vetting tool. Have an LLM examine the code then write a spec of what it reads and writes, & you can examine that before running it. If something in the list is suspect.. you’ll know before you’re hosed not after :)
Not a professional developer (though Guillermo certainly is) so take this with a huge grain of salt, but I like the idea of an AI "trained" on security vulnerabilities as a second, third and fourth set of eyes!
While I agree with the idea of vetting things, I too get a chuckle when folks jump straight from "we can't trust this unknown code" to "let's trust AI to vet it for us". Done it myself.
that's cool and all, before you get malicious code that includes prompt injections and code that never runs but looks super legit.
LLMs are NOT THOROUGH. Not even remotely. I don't understand how anyone can use LLMs and not see this instantly. I have yet to see an LLM get a better failure rate than around 50% in the real world with real world expectations.
Especially with code review, LLMs catch some things, miss a lot of things, and get a lot of things completely and utterly wrong. It takes someone wholly incompetent at code review to look at an LLM review and go "perfect!".
Edit: Feel free to write a comment if you disagree
They work better in small, well-commented code bases in popular languages. The further you stray from that the less successful they are. That's on top of the quality of your prompt, of course.
> I don't understand how anyone can use LLMs and not see this instantly
Because people in general are not thorough. I've been playing around with Claude Code and before that, Cursor. And both are great tools when targeted correctly. But I've also tried "Vibe" coding with them and it is obvious where people get fooled - it will build a really nice looking shell of a product that appears to be working, but then you step into using it past the surface layer and issues start to show. Most people don't look past the surface layer, and instead keep digging in having the agent build on the crappy foundation, until some time later it all falls apart (And since a lot of these people aren't developers, they have also never heard of source control.)
If you know that LLMs are not thorough going into it, then you can get your failure rates way lower than 50%. Of course if you just paste a product spec into an LLM, it will do a bad job.
If you build an intuition for what kinds of asks an LLM (agent, really) can do well, you can choose to only give it those tasks, and that's where the huge speedups come from.
Don't know what to do about prompt injection, really. But "untrusted code" in the broader sense has always been a risk. If I download and use a library, the author already has free reign of my computer - they don't even need to think about messing with my LLM assistant.
My suggestion is to try CC, use a language like Go, and read their blogs how they use it internally. They are transparent what works and what does not work.
You can always chroot the directory you're using to isolate the tools from the rest of your system. That is unless your using a toy operating system of course. ;)
This is what got me started with claude-code. I gave it a try using openrouter API and got a bill of $40 for 2-3 hours of work. At that point, subscription to the Anthropic plan became a no-brainer
I tried quite a few of them, including the cheap / free models but the only one that was really working was claude. The others were hanging whenever the model needed a confirmation for action. Mind you, this was some time ago.
The agentic instructions just seem to be better. It does stuff by default (such as working up a plan of action) that other agents need to be prompted for, and it seems to get stuck less in failure sinks. The actual Claude model is decent, but claude code is probably the best agentic tool out there right now.
tbh, claude code is the only product that feels like its made by people who have actually used AI tooling on legacy codebases
for pretty much every other tool i've used, you walk away from it with the overwhelming feeling that whoever made this has never actually worked at a company in a software engineering team before
i realize this isn't an answer with satisfactory evidence-based language. but I do believe that there's a core `product-focus` difference between claude with other tools
Claude's edge comes from its superior context handling (up to 200K tokens), better tool use capabilities, and constitutional AI training that reduces hallucinations in code generation.
In my case, I ended up accruing $100/day w/ Claude Code (on github workflows) so Max x20 was an easy decision.
Pro seems targeted at a very different use case. Personally, I’ve never used the chat enough to break even. But someone who uses it several times per day might.
ETA: I get that the benefits transfer between the two, just with different limits. I still think it’s pretty clear which kind of usage each plan is intended for.
I tried installing and setting up the project today, it was miserable. I finally got it to work only to find out that the mistral models' tool calling does not work at all for claude code. Also, there is no mention anywhere of what models actually support anthropic level tool calling. If anyone knows if there are some open weight models (deepseek or others) I can host on my infra to get this to work out of the box that would be amazing.
Unfortunately, I haven’t been able to use this with many of the recent open weight code/instruct models - CC tool use doesn’t work with Qwen3 and Kimi K2 for me.
Anyone care to compare the current Aider with Claude Code? I tried Aider 6+ months ago and liked it but haven't tried it more recently because Claude Code is working so well for me. But I keep feeling like I should try Aider again.
Aider is good at one-shotting Git commits, but requires a human in the loop for a lot of iteration. Claude Code is better at iterating on problems that take multiple tries to get right (which is most problems IMO). I was really impressed by Aider until I started using CC.
I moved from Aider to ClaudeCode for the simple reason i usually use IntelliJ Idea and even if poorer than RooCode on VSCode, integration between IntelliJ and ClaudeCode is reasonably solid.
That said today i started using CCR since the possibility to use different models is extremely interesting (and the reason why i initially used Aider)
I wish for a vetting tool. Have an LLM examine the code then write a spec of what it reads and writes, & you can examine that before running it. If something in the list is suspect.. you’ll know before you’re hosed not after :)
If the first llm wasn’t enough, the second won’t be either. You’re in the wrong layer.
Not a professional developer (though Guillermo certainly is) so take this with a huge grain of salt, but I like the idea of an AI "trained" on security vulnerabilities as a second, third and fourth set of eyes!
Most of these tools are not that exciting. These are similar-looking TUIs around third-paty models/LLM calls.
What is the difference between this, and https://opencode.ai? Or any of the half a dozen tools that appeared on HN in the past few weeks?
LLMs are NOT THOROUGH. Not even remotely. I don't understand how anyone can use LLMs and not see this instantly. I have yet to see an LLM get a better failure rate than around 50% in the real world with real world expectations.
Especially with code review, LLMs catch some things, miss a lot of things, and get a lot of things completely and utterly wrong. It takes someone wholly incompetent at code review to look at an LLM review and go "perfect!".
Edit: Feel free to write a comment if you disagree
Because people in general are not thorough. I've been playing around with Claude Code and before that, Cursor. And both are great tools when targeted correctly. But I've also tried "Vibe" coding with them and it is obvious where people get fooled - it will build a really nice looking shell of a product that appears to be working, but then you step into using it past the surface layer and issues start to show. Most people don't look past the surface layer, and instead keep digging in having the agent build on the crappy foundation, until some time later it all falls apart (And since a lot of these people aren't developers, they have also never heard of source control.)
If you build an intuition for what kinds of asks an LLM (agent, really) can do well, you can choose to only give it those tasks, and that's where the huge speedups come from.
Don't know what to do about prompt injection, really. But "untrusted code" in the broader sense has always been a risk. If I download and use a library, the author already has free reign of my computer - they don't even need to think about messing with my LLM assistant.
Could work I think (be wary of sending .env to the web though)
Is it just better prompting? Better tooling?
for pretty much every other tool i've used, you walk away from it with the overwhelming feeling that whoever made this has never actually worked at a company in a software engineering team before
i realize this isn't an answer with satisfactory evidence-based language. but I do believe that there's a core `product-focus` difference between claude with other tools
In my case, I ended up accruing $100/day w/ Claude Code (on github workflows) so Max x20 was an easy decision.
Pro seems targeted at a very different use case. Personally, I’ve never used the chat enough to break even. But someone who uses it several times per day might.
ETA: I get that the benefits transfer between the two, just with different limits. I still think it’s pretty clear which kind of usage each plan is intended for.
We may finally get to the devs doing lock-in using ultra complex syntax languages in a much more efficient way using LLMs.
I have already some ideas for some target c++ code to port to C99+.
1: https://aider.chat/
Compare with https://github.com/sst/opencode/pulse/monthly
Ofc some might prefer the pure CLI experience, but mentioning that because it also supports a lot of providers.