> What other tasks could be automated today with the current LLMs performance?
CEO speeches and pro-LLM blogs come to mind.
Again, there is a vague focus on "updating dependencies" where allegedly some time was saved. Take that to the extreme and we don't need any new software. Freeze Linux and Windows, do only security updates and fire everyone. Because the ultimate goal of LLM shills or self-hating programmers appears to be to eliminate all redundant work.
Be careful what you wish for. They won't reward you for shilling or automating, they'll just fire you.
This. They've been pushing these at my workplace and the only thing I can think to use it for is have the LLMs generate empty long-winded corporate-speak emails that I can send to managers when they ask for things that seem best answered by an empty long-winded corporate-speak email. Like "How are using using all these AI tools we are forcing on you without asking if you needed or wanted them?"
This feels a bit too optimistic, in practice it often gets stuck going down a rabbit hole (and burning up your requests / tokens doing it!).
Like even when I tested it on a clean assessment (albeit with Cursor in this case) - https://jamesmcm.github.io/blog/claude-data-engineer/ - it did very well in agent mode, but the questions it got wrong were worrying because they're the sort of things that a human might not notice either.
That said I do think you could get a lot more accuracy between the agent checking and running its own answers, and then also sending its diff to a very strong LLM like o3 or Gemini Pro 2.5 to review it - it's just a bit expensive to do that atm.
The main issue on real projects is that just having enough context to even approach problems, and build and run tests is very difficult when you have 100k+ lines of code and it takes 15 minutes to clean build and run tests. And it feels like we're still years away from having all of the above, plus a large enough context window that this is a non-issue, for a reasonable price.
Like, its a nerd slot machine: shows you small wins, gets you almost big wins and seduces you into thinking "just one more perfect prompt and surely ill hit the jackpot"
I really enjoyed Claude Code. I was using it on some side projects for about a month with API credits, and I signed up for the Max subscription shortly after it started working with Code. Overnight, my account was banned, and I have no idea why.
It sucks getting banned from such a cool and helpful tool :(
I had two accounts banned - one for Claude and one for the API. I tried to appeal both asking for more information. The response from Anthropic was non-specific and only that it violates usage. One account had only been minimally used. One never used. The accounts used email addresses using a domain I control - e.g., anthropic-claude@domain.xyz for example. I think that might have something to do with it.
I have a new account now using a Google account and it hasn’t been banned.
nope, just running and stopping dev servers. It may have done a pfkill once or twice if something was hanging?
Either way, using it was the API credits was fine for a little over a month, so I don't know if it was that. I got autobanned only a few hours after paying for Max and reauthing the client to use the subscription. My actual usage of it didn't change.
The recent developments are impressive. I’m now using my IDE as a diff viewer.
Everything goes through the terminal. If there is an error, CC can analyse and fix it.
Still needs a lot of handholding. I do not (yet) think big upfront plans will suddenly start working in the enterprise world. Let it write a failing test first.
I'm still not convinced. I spent a few hours today trying to get it to add linting to a SQL repository, _given another repository that already had what I wanted_.
At one point it got a linting error and just added that error to the ignore list. I definitely I spent more time reviewing this code and prompting than it would have taken for me to do it myself. And it's still not merged!
Many of the complainers don't know how to use them and how to write prompts, and then blame the LLMs.
Or simply use LLMs that struggle at writing good code (GPT, Gemini Pro, etc).
You need to be in the shoes of a product owner, and be able to express your requirements clearly and drive the LLM in your direction, and this requires to learn new skills (like kids learn how to use search engines).
> Or simply use LLMs that struggle at writing good code (GPT, Gemini Pro, etc).
I love how one side of this debate seems to have embraced "No True Scotsman" as the preferred argument strategy. Anyone who points out that these things have practical limitations gets a litany of "oh you aren't using it right" or "oh, you just aren't using the cool model" in response. It reminds me of the hipsters in SF who always felt your music was a little too last week.
As someone who is currently using these every day, Gemini Pro is right up there with the very best models for writing code -- and "GPT" is not a single thing -- so I have no idea what you're talking about. These things have practical limitations.
> Or simply use LLMs that struggle at writing good code (GPT
As still the default for GitHub Copilot GPT doesn't seem to "struggle" at all with writing good code. Anecdotally, in comparison with GPT, Claude seems woefully under-trained in areas such as PowerShell and cross-platform solutions compared to GPT. (Which also seems to show directly in Claude Code's awful cross-platform support. If Claude is so good why doesn't it fix Claude Code's Windows support? Add more PowerShell support instead of just bashing out bash-isms?)
A lot of impressions of the LLMs are hugely subjective, and I'm inclined to the above poster's suggestion a lot of of what you get out of an LLM is a reflection of who you are and what you put in to the LLM. (They are massively optimized GIGO machines after all.)
in my experience using agents has just wasted my time and money, they are good for small things if you are lazy and watching a movie looking at the results every 10 minutes, reverting and trying again
I have used it on a fairly simple Kotlin Android application and was blown away. I have previously been using paid ChatGPT, Github Copilot, and Gemini. In my opinion, it's the complete access to your repo that really makes it powerful, whereas with the other plugins you kind of have to manually feed it the files in your workspace and keep them in sync.
I asked it to add Google Play subscription support to my application and it did, it required minimal tweaking.
I asked it to add a screen for requesting location permissions from the user and it did it perfectly. No adjustment.
I also asked it add a query parameter to my API (GoLang) which should result in a subtle change several layers deep and it had no problems with that.
None of this is rocket science and I think the key is that it's all been done and documented a million times on the Internet. At this point, Claude Code is at least as effective as junior developer.
Yes, I understand that this is a Faustian bargain.
It gives us great productivity. If you write the tests yourself and insist it delivers 100% success without touching the tests themselves, just run them, it is very nice. We wrote a little bit of tooling around it so it instructs and loops until 100% succeed. Even for stuff that's complex enough for seniors to struggle (parsers/compilers), it delivers results after hours instead of days or weeks. But if you miss some tests you can all but guarantee that those things won't work even though an experienced human would automatically do that right as it is illogical for instance. But we would write tests like this for humans as well, so there is not much difference in our workflow; CC delivers faster and far far cheaper. And we tried it all, especially NOT having it integrated into an ide is brilliant. Before we used aider instead of cursor etc as we can control it: we don't want a human sitting there tapping 'yes, please do' or whatnot. We want it to finish, commit a PR and then review.
It's great at mocking up some HTML pages with eg Tailwind and static site generators. Give it some ideas, a bit of copy, a few colours and it'll create some pages filled with plausible sounding text. I can imagine using it in front of clients to give them an idea of what a new site could look like.
Easily adjusted with things like "the colour palette is a bit bright, use more pastels" or "make it more SEO friendly" and it often easily generates a large todo list/set of changes based on minimal input
My friend was mulling over a product concept and I used it to design a landing page and it helped her see how easily you can create a website to sell the product. It took ~15 minutes and I'm a web dev noob. (Obviously setting up a real ecommerce site is a little bit more work)
It makes sense it's good at HTML because of the huge body of public data available.
I've been very successful pointing it to a backlog of manual test cases, using Playwright MCP to execute the test cases against dev as a black box, and generating the corresponding Playwright scripts to add to our automated test repo.
I had hired an actual automated tester with years of experience to write playwright scripts for us. After 3 months he had not produced a single passing test. I managed to build the entire scaffolding myself in 2 weeks having no prior playwright experience.
I use CC in existing code bases to build out new GUI - VueJS/Quasar and it blows me away! For back end Rust code it excels at boilerplate crud handlers back to the db - it copies the style of existing code… I’ll happily pay for it if my boss does not, just work less hours…
The productivity gains decrease with user experience. A high-performing senior engineer won't get a lot, but I think they've reached a point now where even seniors will benefit a fair amount. For me it's not really that they increase my productivity directly, but they let me offload a lot of the cognitive load. I'm getting a similar amount of work done and I don't feel as drained at the end of the day.
Usually, I'll go through my coding like I would have pre-LLMs.
Then, when I see something that looks like it can be reliably automated by an AI agent, I'll open up Cline and put Claude or Gemini Flash to work. This has a 90% success rate so far and has saved me hours of work.
CEO speeches and pro-LLM blogs come to mind.
Again, there is a vague focus on "updating dependencies" where allegedly some time was saved. Take that to the extreme and we don't need any new software. Freeze Linux and Windows, do only security updates and fire everyone. Because the ultimate goal of LLM shills or self-hating programmers appears to be to eliminate all redundant work.
Be careful what you wish for. They won't reward you for shilling or automating, they'll just fire you.
Like even when I tested it on a clean assessment (albeit with Cursor in this case) - https://jamesmcm.github.io/blog/claude-data-engineer/ - it did very well in agent mode, but the questions it got wrong were worrying because they're the sort of things that a human might not notice either.
That said I do think you could get a lot more accuracy between the agent checking and running its own answers, and then also sending its diff to a very strong LLM like o3 or Gemini Pro 2.5 to review it - it's just a bit expensive to do that atm.
The main issue on real projects is that just having enough context to even approach problems, and build and run tests is very difficult when you have 100k+ lines of code and it takes 15 minutes to clean build and run tests. And it feels like we're still years away from having all of the above, plus a large enough context window that this is a non-issue, for a reasonable price.
It sucks getting banned from such a cool and helpful tool :(
I have a new account now using a Google account and it hasn’t been banned.
Either way, using it was the API credits was fine for a little over a month, so I don't know if it was that. I got autobanned only a few hours after paying for Max and reauthing the client to use the subscription. My actual usage of it didn't change.
Still needs a lot of handholding. I do not (yet) think big upfront plans will suddenly start working in the enterprise world. Let it write a failing test first.
At one point it got a linting error and just added that error to the ignore list. I definitely I spent more time reviewing this code and prompting than it would have taken for me to do it myself. And it's still not merged!
EDIT: https://www.anthropic.com/engineering/claude-code-best-pract...
Or simply use LLMs that struggle at writing good code (GPT, Gemini Pro, etc).
You need to be in the shoes of a product owner, and be able to express your requirements clearly and drive the LLM in your direction, and this requires to learn new skills (like kids learn how to use search engines).
I love how one side of this debate seems to have embraced "No True Scotsman" as the preferred argument strategy. Anyone who points out that these things have practical limitations gets a litany of "oh you aren't using it right" or "oh, you just aren't using the cool model" in response. It reminds me of the hipsters in SF who always felt your music was a little too last week.
As someone who is currently using these every day, Gemini Pro is right up there with the very best models for writing code -- and "GPT" is not a single thing -- so I have no idea what you're talking about. These things have practical limitations.
As still the default for GitHub Copilot GPT doesn't seem to "struggle" at all with writing good code. Anecdotally, in comparison with GPT, Claude seems woefully under-trained in areas such as PowerShell and cross-platform solutions compared to GPT. (Which also seems to show directly in Claude Code's awful cross-platform support. If Claude is so good why doesn't it fix Claude Code's Windows support? Add more PowerShell support instead of just bashing out bash-isms?)
A lot of impressions of the LLMs are hugely subjective, and I'm inclined to the above poster's suggestion a lot of of what you get out of an LLM is a reflection of who you are and what you put in to the LLM. (They are massively optimized GIGO machines after all.)
What’s the use case?
(I tried some things, and it blew up. Thus far my experience w agents in general)
I asked it to add Google Play subscription support to my application and it did, it required minimal tweaking.
I asked it to add a screen for requesting location permissions from the user and it did it perfectly. No adjustment.
I also asked it add a query parameter to my API (GoLang) which should result in a subtle change several layers deep and it had no problems with that.
None of this is rocket science and I think the key is that it's all been done and documented a million times on the Internet. At this point, Claude Code is at least as effective as junior developer.
Yes, I understand that this is a Faustian bargain.
Easily adjusted with things like "the colour palette is a bit bright, use more pastels" or "make it more SEO friendly" and it often easily generates a large todo list/set of changes based on minimal input
My friend was mulling over a product concept and I used it to design a landing page and it helped her see how easily you can create a website to sell the product. It took ~15 minutes and I'm a web dev noob. (Obviously setting up a real ecommerce site is a little bit more work)
It makes sense it's good at HTML because of the huge body of public data available.
Deleted Comment
I had hired an actual automated tester with years of experience to write playwright scripts for us. After 3 months he had not produced a single passing test. I managed to build the entire scaffolding myself in 2 weeks having no prior playwright experience.
Then, when I see something that looks like it can be reliably automated by an AI agent, I'll open up Cline and put Claude or Gemini Flash to work. This has a 90% success rate so far and has saved me hours of work.