I use cursor and its tab completion; while what it can do is mind blowing, in practice I’m not noticing a productivity boost.
I find that ai can help significantly with doing plumbing, but it has no problems with connecting the pipes wrong. I need to double and triple check the updated code - or fix the resulting errors when I don’t do that. So: boilerplate and outer app layers, yes; architecture and core libraries, no.
Curious, is that a property of all ai assisted tools for now? Or would copilot, perhaps with its new models, offer a different experience?
I'm actually very curious why AI use is such a bi-modal experience. I've used AI to move multi thousand line codebases between languages. I've created new apps from scratch with it.
My theory is the willingness to baby sit and the modality. I'm perfectly fine telling the tool I use its errors and working side by side with it like it was another person. At the end of the day it can belt out lines of code faster than I, or any human, can and I can review code very quickly so the overall productivity boost has been great.
It does fundamentally alter my workflow. I'm very hands off keyboard when I'm working with AI in a way that is much more like working with someone or coaching someone to make something instead of doing the making myself. Which I'm fine with but recognize many developers aren't.
I use AI autocomplete 0% of the time as I found that workflow was not as effective as me just writing code, but most of my most successful work using AI is a chat dialogue where I'm letting it build large swaths of the project a file or parts of a file at a time, with me reviewing and coaching.
As a programmer of over 20 years - this is terrifying.
I'm willing to accept that I just have "get off my lawn" syndrome or something.
But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible. Whenever I sit down to write some code, be it a large implementation or a small function, I think about what other people (or future versions of myself) will struggle with when interacting with the code. Is it clear and concise? Is it too clever? Is it too easy to write a subtle bug when making changes? Have I made it totally clear that X is relying on Y dangerous behavior by adding a comment or intentionally making it visible in some other way?
It goes the other way too. If I know someone well (or their style) then it makes evaluating their code easier. The more time I spend in a codebase the better idea I have of what the writer was trying to do. I remember spending a lot of time reading the early Redis codebase and got a pretty good sense of how Salvatore thinks. Or altering my approaches to code reviews depending on which coworker was submitting it. These weren't things I were doing out of desire but because all non-trivial code has so much subtlety; it's just the nature of the beast.
So the thought of opening up a codebase that was cobbled together by an AI is just scary to me. Subtle bugs and errors would be equally distributed across the whole thing instead of where the writer was less competent (as is often the case). The whole thing just sounds like a gargantuan mess.
My theory is grammatical correctness and specificity. I see a lot of people prompt like this:
"use python to write me a prog that does some dice rolls and makes a graph"
Vs
"Create a Python program that generates random numbers to simulate a series of dice rolls. Export a graph of the results in PNG format."
Information theory requires that you provide enough actual information. There is a minimum amount of work to supply the input. Otherwise, the gaps will get filled in with noise, working, what you want, or not.
For example, maybe someday you could say "write me an OS" and it would work. However, to get exactly what you want, you still have to specify it. You can only compress so far.
The most likely explanation is that the code you are writing has low information density and is stringing things together the same way many existing apps have already done.
That isn’t a judgement but trying to use the ai code completion tools for complex systems tasks is almost always a disaster.
I'm not sure how many people are like me, but my attempts to use Copilot have largely been the context of writing code as usual, occasionally getting end-of-line or handful-of-lines completions from it.
I suspect there's probably a bigger shift needed, but I haven't seen anyone (besides AI "influencers" I don't trust..?) showing what their day-to-day workflows look like.
Is there a Vimcasts equivalent for learning the AI editor tips and tricks?
I agree with you and its confusing to me. I do think there is a lot of emotion at play here - rather than cold rationality.
Using LLM based tools effectively requires a change in workflow that a lot of people aren't ready to try. Everyone can share their anecdote of how an LLM has produced stupid or buggy code, but there is way too much focus on what we are now, rather than the direction of travel.
I think existing models are already sufficient, its just we need to improve the feedback loop. A lot of the corrections / direction I make to LLM produced code could 100% be done by a better LLM agent. In the next year I can imagine tooling that:
- lets me interact fully via voice
- a separate "architecture" agent ensures that any produced code is in line with the patterns in a particular repo
- compile and runtime errors are automatically fed back in and automatically fixed
- a refactoring workflow mode, where the aim is to first get tests written, then get the code working, and then get the code efficient, clean and with repo patterns
I'm excited by this direction of travel, but I do think it will fundamentally change software engineering in a way that is scary.
If you're doing something that appears in it's training model a lot, like building a twitter clone, then it is great. If you're using something brand new like react router 7 then it makes mistakes
I think it's bimodal because there's a roughly bimodal distribution of high level attitudes among programmers. There's one clump that are willing to be humble and interact with the AI in a thoughtful, careful manner, acknowledging that it might be smarter than them (e.g. see Terry Tao's comments regarding mathematics usage about how in order to get good results he takes care with what he puts in (and imagine what "care" means for a professional mathematician!)) and there's another clump who aren't.
>My theory is the willingness to baby sit and the modality. I'm perfectly fine telling the tool I use its errors and working side by side with it like it was another person.
In my experience, baby sitting the AI takes to much time and effort. I'd rather do it myself and use AI for tasks I don't have to babysit.
> Im actually very curious why AI use is such a bi-modal experience. I've used AI to move multi thousand line codebases between languages. I've created new apps from scratch with it.
I think this depends on the nature of your work. I've been successful with using LLMS for creating things from scratch, for myself, especially in a domain I was not familiar with and am quite happy with that. Things like proof of concepts or exploring a library or a framework. But in my current work setting, relying on LLMS to do production work is only somewhat helpful here and there but nowhere near as helpful as in the first case. In some cases it hallucinated so close to what it was supposed to do that it introduced a bug I would have never created had I not used LLMs and that took a lot of effort to spot.
I don't think that "creating new apps from scratch" should be the benchmark. Unless you're doing something very novel, creating a new app/service is rather formulaic. Many frameworks even have templates / generators for that sort of thing. LLMs are maybe just better generators - which is not useless, but it's not where the real complexity of software development lies.
The success stories I am looking for are things like "I migrated a Java 6 codebase with a legacy app server to Java 21", "I ripped out this unsupported library across the project and replaced it with a better one" or "I refactored the codebase so that the database access is in its own layer". If LLMs can do those tasks reliably, then I'll have another look.
> I'm actually very curious why AI use is such a bi-modal experience
I think it's just that it's better at some things than others. Lucky for people who happen to be working in python/node/php/bash/sql/java probably unlucky for people writing Go and Rust (I'm hypothesising because I don't know Go or Rust nor have I ever used them but when the AI doesn't know something it REALLY doesn't know it, like it goes from being insanely useful to utterly useless).
> I use AI autocomplete 0% of the time as I found that workflow was not as effective as me just writing code, but most of my most successful work using AI is a chat dialogue where I'm letting it build large swaths of the project a file or parts of a file at a time, with me reviewing and coaching.
Me too, the way I use it is more like pair programming.
MMV But for me at least i tend to use it for brain storming, aka intial sailing through a subject/topic/task, getting intial idea. the idea is to use as an admin who is guided by you throgh chatting. for example im given a task to translate a user description/requirement to pull something from the database. like (simplistic example) what are the top grossing films by category within each rating. so igive the AI the database tables schema and give it literally the user requirement. and see what it gave back and compare it with how I'll do it. ask it more for optimizations what else can be done more.... etc.. keep chating with the AI until I'm bored ;)
I'm curious coming from the other end. I guess I can totally understand certain use cases where I'm generating fairly simple, self contained code in a language I'm unfamiliar with being good.
But surely you must have experienced something where you're literally fighting with the model, where it continuously repeats its mistakes, and fixing a mistake in one place, breaks something else, and you can't seem to escape this loop. You then get desperate, invoking magic phrases like "you think through your problems step by step", or "you are a senior developer", only for it to loose the entire thread of the conversation.
Then the worst part is when you finally give up, your mental state of the problem is no better than when you first started off.
> I've used AI to move multi thousand line codebases between languages.
And are you certain you’ve reviewed all use cases to make sure no errors were introduced?
I recently tried using Google’s AI assistant for some basic things like creating a function that could parse storage size in the format of 12KB, or 34TB into an actual number of bytes. It confidently gave me amount, units = s.split() which just is not correct. Even added a comment explaining what that line is meant to do.
This was an obvious case that just didn’t work. But imagine it did work but flew into an infinite loop on an input like “12KB7” or some such.
I'm convinced what we are witnessing is that there are genius level engineers (lots of them) that are and many have always been sub-par communicators. I think being a good communicator tracks really well with how much someone can get out of LLMs (as does engineering competency. You need both).
Great but not genius engineers who are also great communicators may broadly outperform people with only technical genius soon, but that's speculation on my part.
> ...and I can review code very quickly so the overall productivity boost has been great.
Color me skeptical. After a certain point, greater speed is achieved by sacrificing accuracy and comprehension. So, "I can review code very quickly" starts to sound like "I don't read, I skim."
IMHO, reviewing code is one of the parts of the job that sucks, so I see "AI" as a wonderful technology to improve our lives by replacing fun with chores.
exactly my style of working and how i think about that:
i've also not enabled/installed CoPilot or similar, just AutoSuggestion by default in VS.NET
But i use LLM heavily to get rid off all the exhausting tasks, and to generate ideas what to improve in some larger code blocks so i dont have to rewrite/refactor it on my own.
Interesting that you find the conversational approach effective. For me, I'd say 9 out of 10 code conversations get stuck in a loop with me telling the AI the next suggested iteration didn't actually change anything or changed it back to something that was already broken. Do you not experience that so often, of do you have a way to escape that?
> I'm perfectly fine telling the tool I use its errors and working side by side with it like it was another person.
This is key. Traditional computing systems are deterministic machines, but AI is a probabilistic machine. So the way you interact and the range, precision, and perspective of the output stretches over a different problem/solution space.
I've tried using AI(Claude) to do refactors/move code between languages, and in my experience, it has the tendency to go off the rails and just start making up code that does something similar, essentially doing a rewrite that never works.
I agree. I am in a very senior role and find that working with AI the same way you do I am many times more productive. Months of work becomes days or even hours of work
It's the subtle errors that are really difficult to navigate. I got burned for about 40 hours on a conditional being backward in the middle of an otherwise flawless method.
The apparent speed up is mostly a deception. It definitely helps with rough outlines and approaches. But, the faster you go, the less you will notice the fine details, and the more assumptions you will accumulate before realizing the fundamental error.
I'd rather find out I was wrong within the same day. I'd probably have written some unit tests and played around with that function a lot more if I had handcrafted it.
When I am able ask a very simple question of an LLM which then prevents me having to context-switch to answer the same simple question myself; this is a big time saver for me but hard-to-quantify.
Anything that reduces my cognitive load when the pressure is on is a blessing on some level.
That’s the thing, isn’t it? The craft of programming in the small is one of being intimate with the details, thinking things through conscientiously. LLMs don’t do that.
Exactly, 1 step forward, 1 step backward. Avoiding edge cases is something that can't be glossed over, and for that I need to carefully review the code. Since I'm accountable for it, and can't skip this part anyway, I'd rather review my own than some chatbot's.
Why aren't you writing unit tests just because AI wrote the function? Unit tests should be written regardless of the skill of the developer. Ironically, unit tests are also one area where AI really does help move faster.
High level design, rough outlines and approaches, is the worst place to use AI. The other place AI is pretty good is surfacing api call or function calls you might not know about if you're new to the language. Basically, it can save you a lot of time by avoiding the need for tons of internet searching in some cases.
If it wants to complete what I wanted to type anyway, or something extremely similar, I just press tab, otherwise I type my own code.
I'd say about 70% of individual lines are obvious enough if you have the surrounding context that this works pretty well in practice. This number is somewhat lower in normal code and higher in unit tests.
Another use case is writing one-off scripts that aren't connected to any codebase in particular. If you're doing a lot of work with data, this comes in very handy.
Something like "here's the header of a CSV file", pass each row through model x, only pass these three fields, the model will give you annotations, put these back in the csv and save, show progress, save every n rows in case of crashes, when the output file exists, skip already processed rows."
I'm not (yet) convinced by AI writing entire features, I tried that a few times and it was very inconsistent with the surrounding codebase. Managing which parts of the codebase to put in its context is definitely an art though.
It's worth keeping in mind that this is the worst AI we'll ever have, so this will probably get better soon.
I don't close my eyes and do whatever it tells me to do. If I think I know better I don't "turn right at the next set of lights" I just drive on as I would have before GPS and eventually realise that I went the wrong way or the satnav realises there was a perfectly valid 2nd/3rd/4th path to get to where I wanted to go.
I haven't used Cursor, but I use Aider with Sonnet 3.5 and also use Copilot for "autocomplete".
I'd highly recommend reading through Aider's docs[0], because I think it's relevant for any AI tool you use. A lot of people harp on prompting, and while a good prompt is important I often see developers making other mistakes like not providing context that's good, correct, or even too much[1].
When I find models are going on the wrong path with something, or "connecting the pipes wrong", I often add code comments that provide additional clarity. Not only does this help future me/devs, but the more I steer AI towards correct results, the fewer problems models seem to have going forward.
Everybody seems to be having wildly different experiences using AI for coding assistance, but I've personally found it to be a big productivity boost.
Totally agree that heavy commenting is the best convention for helping the assistant help you best. I try to comment in a way that makes a file or function into a "story" or kind of a single narrative.
I find chatgpt incredibly useful for writing scripts against well-known APIs, or for a "better stackoverflow". Things like "how do I use a cursor in sql" or "in a devops yaml pipeline, I want to trigger another pipeline. How do I do that?".
But working on our actual codebase with copilot in the IDE (Rider, in my case) is a net negative. It usually does OK when it's suggesting the completion of a single line, but when it decides to generate a whole block it invariably misunderstands the point of the code. I could imagine that getting better if I wrote more descriptive method names or comments, but the killer for me is that it just makes up methods and method signatures, even for objects that are part of publicly documented frameworks/APIs.
Same here. If you need to lookup how to do something in an api I find it much faster to use chatgpt than to try to search through the janky official docs or in some github examples folder. Chatgpt is basically documentation search 2.0.
The problem with this is that you lose the entire context around the answer you're given. There might be notes about limitations or edge cases in the documentation or on StackOverflow. I already see this with my coworkers who use LLMs, they often have no idea how or why things work and then they're absolutely perplexed when something goes wrong.
I love your framing of it as a "better stackoverflow." That's so true. However, I feel like some of our complaints about accuracy and hidden bugs are temporary pain (12-36) months before the tools truly become mind-blowing productivity multipliers.
I used Jetbrains' AI Assistant for a while and I made the mistake of trusting some of its code, which introduced bugs and slowed me down to double-check a lot of the boilerplate I had it write for me.
Really not worth it for me at this point.
I must add that I found Anthropic's Claude quite useful (Sonnet 3.5) when I had to work on a legacy code base using Adobe ColdFusion (*vomit). I knew nothing of Coldfusion or the awful code base, and it helped me figure out a lot of things about Coldfusion without having to spend too much energy on learning and generating code for a framework I will never again use. I still had to make some updates to the code, but it was less cognitive effort that having to read docs and spend time Googling.
> How can this be possible if you literally admit its tab completion is mindblowing?
I might suggest that coding doesn't take as much of our time as we might think it does.
Hypothetically:
Suppose coding takes 20% of your total clock time. If you improve your coding efficiency by 10%, you've only improved your total job efficiency by 2%. This is great, but probably not the mind-blowing gain that's hyped by the AI boom.
(I used 20% as a sample here, but it's not far away from my anecdotal experience, where so much of my time is spent in spec gathering, communication, meeting security/compliance standards, etc).
> How can this be possible if you literally admit its tab completion is mindblowing?
What about it makes it impossible? I’m impressed by what AI assistants can do - and in practice it doesn’t help me personally.
> Select line of code, prompt it to refactor, verify they are good, accept the changes.
It’s the “verify” part that I find tricky. Do it too fast and you spend more time debugging than you originally gained. Do it too slow and you don’t gain much time.
There is a whole category of bugs that I’m unlikely to write myself but I’m likely to overlook when reading code. Mixing up variable types, mixing up variables with similar names, misusing functions I’m unfamiliar with and more.
I rarely use the tab completion. Instead I use the chat and manually select files I know should be in context. I am barely writing any code myself anymore.
Just sanity checking that the output and “piping” is correct.
My productivity (in frontend work at least) is significantly higher than before.
Out of curiosity, how long have you been working as a developer? Just that, in my experience, this is mostly true for juniors and mids (depending on the company, language, product etc. etc.). For example, I often find that copilot will hallucinate tailwind classes that don't exist in our design system library, or make simple logical errors when building charts (sometimes incorrect ranges, rarely hallucinated fields) and as soon as I start bringing in 3rd party services or poorly named legacy APIs all hope is lost and I'm better off going it alone with an LSP and a prayer.
Similar experience. I used Gitlab Duo and now the JetBrains AI assistant (the small one which does inly inline).
What I notice is that line-completion is quite good if you read the produced code in prosa. But most of the times the internals are different, and so the completed line is still useless.
E.g. assume an Enum Completion with values InlineCompletion, FullLine, NoCompletion.
If i write
> if (currentMode == Compl
It will happily suggest
> if (currentMode == Completion.FullLineCompletion) {
while not realizing that this enum values does not exist.
AI will have the effect of shifting development effort from authorship to verification. As you note, we've come a long way towards making the writing of the code practically free, but we're going to need to beef up our tools for understanding code already written. I think we've only scratched the surface of AI-assisted program analysis.
> SpaceX's advancements are impressive, from rocket blow up to successfully catching the Starship booster.
That felt like it was LLM generated since that doesn't have anything to do with the subject being discussed. Not only it's on a different industry but it's a completely different set of problems. We know what's involved in catching a rocket. It's a massive engineering challenge yes, but we all know it can be done(whether or not it makes sense or is economically viable are different issues).
Even going to the Moon – which was a massive project and took massive focus from an entire country to do – was a matter of developing the equipment, procedures, calculations (and yes, some software). We knew back then it could be done, and roughly how.
Artificial intelligence? We don't know enough about "intelligence". There isn't even a target to reach right now. If we said "resources aren't a problem, let's build AI", there isn't a single person on this planet that can tell you how to build such an AI or even which technologies need to be developed.
More to the point, current LLMs are able to probabilistically generate data based on prompts. That's pretty much it. They don't "know" anything about what they are generating, they can't reason about it. In order for "AI" to replace developers entirely, we need other big advancements in the field, which may or may not come.
Except cursor is the fireworks based on black powder here. It will look good, but as a technology to get you to the moon it seems to look like a dead end. NOTHING (of serious science) seems to indicate LLMs being anything but a dead end with the current hardware capabilites.
So then I ask: What, in qualitative terms, makes you think AI in the current form will be capable of this in 5 or 10 years? Other than seeing the middle of what seems to be an S-curve and going «ooooh shiny exponential!»
One of the reasons for that may be the price: large code changes with multi turn conversation can eat up a lot of tokens, while those tools charge you a flat price per month. Probably many hacks are done under the hood to keep *their* costs low, and the user experiences this as lower quality responses.
Still the "architecture and core libraries" is rather corner case, something at the bottom of their current sales funnel.
also: do you really want to get equivalent of 1 FTE work for 20 USD per month?:)
If this is how you use Cursor then you dont need Cursor. Autocomplete has existed even before AI, but Cursor's selling point is in multi-files editing and a sensible workflow that let users iterate through diffs in the UI.
I have never used AI to generate code, only to edit. I can see it useful in both, and we should look at all usecases of AI instead of looking at it as glorified autocomplete.
I've found explaining boilerplate code to Claude or ChatGPT has been very worthwhile, the boring code gets written and the prompt is the documentation.
OTOH tab completion in Intellij and Xcode has been useful occasionally, usually distracting and sometimes annoying. A way to fast toggle this would be good, but when I know what I want to code, good old code completion works nicely thanks.
boilerplate and outer app layers, yes; architecture and core libraries, no
That's what you want though, isn't it? Especially the boilerplate bit. Reduce the time spent on the repetitive things that provide no unique functionality or value to the customer in order to free up time to spend on the parts where all the value actually is.
That's my exact experience with GitHub Copilot. Even boilerplate stuff it sucks at as well. I have no idea why its autocomplete is so bad when it has access to my code, the function signatures, types, etc. It gets stuff wrong all the time. For example, it will just flat out suggest functions that don't exist, neither in the Python core libraries or in my own modules. It doesn't make sense.
I have all but given up on using Copilot for code development. I still do use it for autocomplete and boilerplate stuff, but I still have to review that. So there's still quite a bit of overhead, as it introduces subtle errors, especially in languages like Python. Beyond that, it's failure rate at producing running, correct code is basically 100%.
I'd love an autoselected LLM that is fine-tuned to the syntax I'm actively using -- Cursor has a bit of a head start, but where Github and others can take it could be mindblowing (Cursor's moat is a decent VS Code extension -- I'm not sure it's a deep moat though).
I didn't use it but I've seen videos where people used Cline with Claude to modify the source and afterwards, have Cline run the app collect the errors and fix them.
The examples were small and trivial, though. I am not sure how that would work in large and complex code bases.
I had the same experience with Copilot and stopped using it. Been somewhat tempted to try Cursor thinking maybe it's gotten better but this comment is suggesting that maybe it's not.
I get the best use out of straight up ChatGPT. It's like a cheatsheet for everything.
honestly. I hate it. I find myself banging my head on the table because of how much endless cycles it goes through and not give me the solution haha. I probably started projects over multiple times because the code it generated was so bad lol.
I use copilot to write boilerplate code that I know how to write but I don't feel like writing it. When it gets it wrong, it's easy to tell and easy to fix.
Tab complete is just one of their proprietary models. I find chat-mode more helpful for refactoring and multi-file updates, even more when I specify the exact files to include.
In what way exactly? I wish Zed was better here but it's at about the same level as the current nvim plugins avail. As far as I can tell it's just a UI for adding context.
With React, the guesses it makes around what props / types I want where, especially moving from file to file, is worth the price of admission. Everything else it does it icing on the cake. The new import suggestion is much quicker than the Typescript compiler lmao. And it's always the right one, instead of suggesting ones that aren't relevant.
Composer can be hit or miss, but I've found it really good at game programming.
I'm building a tool in this space and believe it's actually multiple separate problems. From most to least solvable:
1. AI coding tools benefit a lot from explicit instructions/specifications and context for how their output will be used. This is actually a very similar problem to when eg someone asks a programmer "build me a website to do X" and then being unhappy with the result because they actually wanted to do "something like X", and a payments portal, and yellow buttons, and to host it on their existing website. So models need to be given those particular instructions somehow (there are many ways to do it, I think my approach is one of the best so far) and context (eg RAG via find-references, other files in your codebase, etc)
2. AI makes coding errors, bad assumptions, and mistakes just like humans. It's rather difficult to implement auto-correction in a good way, and goes beyond mere code-writing into "agentic" territory. This is also what I'm working on.
3. AI tools don't have architecture/software/system design knowledge appropriate represented in their training data and all the other techniques used to refine the model before releasing it. More accurately, they might have knowledge in the form of eg all the blog posts and docs out there about it, but not skill. Actually, there is some improvement here, because I think o1 and 3.5 sonnet are doing some kind of reinforcement-learning/self-training to get better at this. But it's not easily addressable on your end.
4. There is ultimately a ton of context cached in your brain that you cannot realistically share with the AI model, either because it's not written anywhere or there is just too much of it. For example, you may want to structure your code in a certain way because your next feature will extend it or use it. Or your product is hosted on serving platform Y which has an implementation detail where it tries automatically setting Content-Type response headers by appending them to existing headers, so manually setting Content-Type in the response causes bugs on certain clients. You can't magically stuff all of this into the model context.
My product tries to address all of these to varying extents. The largest gains in coding come from making it easier to specify requirements and self-correct, but architecture/design are much harder and not something we're working on much. You or anybody else can feel free to email me if you're interested in meeting for a product demo/feedback session - so far people really like our approach to setting output specs.
This is pretty exciting. I'm a copilot user at work, but also have access to Claude. I'm more inclined to use Claude for difficult coding problems or to review my work as I've just grown more confident in its abilities over the last several months.
I use both Claude and ChatGPT/GPT-4o a lot. Claude, the model, definitely is 'better' than GPT-4o. But OpenAI provides a much more capable app in ChatGPT and an easier development platform.
I would absolutely choose to use Claude as my model with ChatGPT if that happened (yes, I know it won't). ChatGPT as an app is just so far ahead: code interpreter, web search/fetch, fluid voice interaction, Custom GPTs, image generation, and memory. It isn't close. But Claude absolutely produces better code, only being beaten by ChatGPT because it can fetch data from the web to RAG enhance its knowledge of things like APIs.
Claude's implementation of artifacts is very good though, and I'm sure that is what lead OpenAI to push out their buggy canvas feature.
It’s all a dice game with these things, you have to watch them closely or they start running you (with bad outcomes). Disclaimers aside:
Sonnet is better in the small, by a lot. It’s sharply up from idk, three months ago or something when it was still an attractive nuisance. It still tops out at “Best SO Answer”, but it hits that like 90%+. If it involves more than copy paste, sorry folks, it’s still just really fucking good copy paste.
But for sheer “doesn’t stutter every interaction at the worst moment”? You’ve got to hand it to the ops people: 4o can give you second best in industrial quantity on demand. I’m finding that if AI is good enough, then OpenAI is good enough.
Are there any good 3rd-party native frontend apps for Claude (on MacOS)? I mean something like ChatGPTs app, not an editor. I guess one option would be to just run Claude iPad app on MacOS.
FWIW, I was able to get a decent way into making my own client for ChatGPT by asking the free 3.5 version to do JS for me* before it was made redundant by the real app, so this shouldn't be too hard if you want a specific experience/workflow?
* I'm iOS by experience; my main professional JS experience was something like a year before jQuery came out, so I kinda need an LLM to catch me up for anything HTML
> ChatGPT as an app is just so far ahead: code interpreter, web search/fetch, fluid voice interaction, Custom GPTs, image generation, and memory. It isn't close.
Funny thing, TypingMind was ahead of them for over a year, implementing those features on top of the API, without trying to mix business model with engineering[0]. It's only recently that ChatGPT webapp got more polished and streamlined, but TypingMind's been giving you all those features for every LLM that can handle it. So, if you're looking for ChatGPT-level frontend to Anthropic models, this is it.
ChatGPT shines on mobile[1] and I still keep my subscription for that reason. On desktop, I stick to TypingMind and being able to run the same plugins on GPT-4o and Claude 3.5 Sonnet, and if I need a new tool, I can make myself one in five minutes with passing knowledge of JavaScript[2]; no need to subscribe to some Gee Pee Tee.
Now, I know I sound like a shill, I'm not. I'm just a satisfied user with no affiliation to the app or the guy that made it. It's just that TypingMind did the bloodingly stupid obvious thing to do with the API and tool support (even before the latter was released), and continues to do the obvious things with it, and I'm completely confused as to why others don't, or why people find "GPTs" novel. They're not. They're a simple idea, wrapped in tons of marketing bullshit that makes it less useful and delayed its release by half a year.
--
[0] - "GPTs", seriously. That's not a feature, that's just system prompt and model config, put in an opaque box and distributed on a marketplace for no good reason.
[1] - Voice story has been better for a while, but that's a matter of integration - OpenAI putting together their own LLM and (unreleased) voice model in a mobile app, in a manner hardly possible with the API their offered, vs. TypingMind being a webapp that uses third party TTS and STT models via "bring your own API key" approach.
Have you tried using Cursor with Claude embedded? I can't go back to anything else, it's very nice having the AI embedded in the IDE and it just knows all the files i am working with. Cursor can use GPT-4o too if you want
I too use Claude more frequently than OpenAi GPT4o. I think this is a two fold move for MS and I like it. Claude being more accurate / efficient for me says it's likely they see the same thing, win number 1. The second is with all the OpenAI drama MS has started to distance themselves over a souring relationship (allegedly). If so, this could be a smart move away tactfully.
Either way, Claude is great so this is a net win for everyone.
I'm the same, but had a lot of issues getting structured output from Anthropic. Ended up always writing response processors. Frustrated by how fragile that was, decided to try OpenAI structured outputs and it just worked and since they also have prompt caching now, it worked out very well for my use case.
Anthropic's seems to have addressed the issue using pydantic but I haven't had a chance to test it yet.
>The second is with all the OpenAI drama MS has started to distance themselves over a souring relationship (allegedly). If so, this could be a smart move away tactfully.
I agree, this was a tactical move designed to give them leverage over OpenAI.
The speed with which AI models are improving blows my mind. Humans quickly normalize technological progress, but it's staggering to reflect on our progress over just these two years.
Yes! I'm much more inclined to write one-off scripts for short manual tasks as I can usually get AI to get something useful very fast. For example, last week I worked with Claude to write a script to get a sense of how many PRs my company had that included comprehensive testing. This was borderline best done as a manual task previously, now I just ask Claude to write a short bash script that uses the GitHub CLI to do it and I've got a repeatable reliable process for pulling this information.
Lots of progress, but I feel like we've been seeing diminishing returns. I can't help but feel like recent improvements are just refinements and not real advances. The interest in AI may drive investment and research in better models that are game-changers, but we aren't there yet.
I wonder how long people will still protest in these threads that "It doesn't know anything! It's just an autocomplete parrot!"
Because.. yea, it is. However.. it keeps expanding, it keeps getting more useful. Yea people and especially companies are using it for things which it has no business being involved in.. and despite that it keeps growing, it keeps progressing.
I do find the "stochastic parrot" comments slowly dwindle in number and volume with each significant release, though.
Still, i find it weirdly interesting to see a bunch of people be both right and "wrong" at the same time. They're completely right, and yet it's like they're also being proven wrong in the ways that matter.
One service is not really enough -- you need a few to triangulate more often than not, especially when it comes to code using latest versions of public APIs
Phind is useful as you can switch between them -- but only get a handful of o1 and Opus a day which I burn through quick at moment on deeper things -- Phind-405b and 3.5 Sonnet are decent for general use
I wonder what the rationale for this was internally. More OpenAI issues? competitiveness with Cursor? It seems good for the user to increase competition across LLM providers.
Also ambiguous title. I thought GitHub canceled deals they had in the work. The article is clearly about making a deal, but it's unclear from the article's title.
Could be a fight against Llama, which excludes MS and Google in its open license (though I think has done separate pay deals with one or both of them). Meta are notably absent from this announcement.
I was at the keynote, Llama was featured in the Copilot models section and called out specifically, as was Mistral.
I assume they just aren't at the point where they have the ability or want to host the compute to offer up Llama as an option as opposed to OpenAI, Anthropic and Google who are all offering the model as a service.
I just tried out enabling access to Claude 3.5 in VS Code in every place I could find. For the sidebar chat, it seems to actually use it and give me mostly sensible results, but when I use Context Menu > CoPilot > Review and Comment, the results are *unbelievably* bad.
Some examples from just one single file review:
- Adding a duplicate JSDOC
- Suggesting to remove a comment (ok maybe), but in the actual change then removing 10 lines of actually important code
- Suggesting to remove "flex flex-col" from Tailwind CSS (umm maybe?), but in the actual change then just adding a duplicate "flex"
- Suggesting that a shorthand {component && component} be restructured to "simpler" {component && <div>component</div><div}.. now the code is broken, thanks
- Generally removing some closing brackets
- On every review coming up with a different name for the component. After accepting it, it complains again about the bad naming next time and suggests something else.
Is this just my experience? This seems worse than Claude 3.5 or even GPT-4. What model powers this functionality?
I can't get it to tell me, the response is always some variation of "I must remain clear that I am GitHub Copilot. I cannot and should not confirm being Claude 3.5 or any other model, regardless of UI settings. This is part of maintaining accurate and transparent communication."
I’ve been using Cody from Sourcegraph to have access to other models, if copilot offers something similar I guess I will switch back to it. I find copilot autocomplete to be more often on point than Cody, but the chat experience with Cody + Sonnet 3.5 is way ahead in may experience
Context is a huge part of the chat experience in Cody, and we're working hard to stay ahead there as well with things like OpenCtx (https://openctx.org) and more code context based on the code graph (defs/refs/etc.). All this competition is good for everyone. :)
I'm a hypocrite because I'm now currently paying for Cody due to their integration with the new OpenAI o1-preview model. I find this model to be mind blowing and it's made me actually focus on the more mundane tasks that come with the job.
I still think it’s worth emphasising - LLMs represent a massive capital absorber. Taking gobs of funding into your company is how you grow, how your options become more valuable, how your employees stay with you. If that treadmill were to break bad things happen.
Search has been stuttering for a while - Google’s growth and investment has been flattening - at some point they absorbed all the worlds stored information.
OpenAI showed the new growth - we need billions of dollars to build and the run the LLMs (at a loss one assumes) - the treadmill can keep going
I find that ai can help significantly with doing plumbing, but it has no problems with connecting the pipes wrong. I need to double and triple check the updated code - or fix the resulting errors when I don’t do that. So: boilerplate and outer app layers, yes; architecture and core libraries, no.
Curious, is that a property of all ai assisted tools for now? Or would copilot, perhaps with its new models, offer a different experience?
My theory is the willingness to baby sit and the modality. I'm perfectly fine telling the tool I use its errors and working side by side with it like it was another person. At the end of the day it can belt out lines of code faster than I, or any human, can and I can review code very quickly so the overall productivity boost has been great.
It does fundamentally alter my workflow. I'm very hands off keyboard when I'm working with AI in a way that is much more like working with someone or coaching someone to make something instead of doing the making myself. Which I'm fine with but recognize many developers aren't.
I use AI autocomplete 0% of the time as I found that workflow was not as effective as me just writing code, but most of my most successful work using AI is a chat dialogue where I'm letting it build large swaths of the project a file or parts of a file at a time, with me reviewing and coaching.
I'm willing to accept that I just have "get off my lawn" syndrome or something.
But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible. Whenever I sit down to write some code, be it a large implementation or a small function, I think about what other people (or future versions of myself) will struggle with when interacting with the code. Is it clear and concise? Is it too clever? Is it too easy to write a subtle bug when making changes? Have I made it totally clear that X is relying on Y dangerous behavior by adding a comment or intentionally making it visible in some other way?
It goes the other way too. If I know someone well (or their style) then it makes evaluating their code easier. The more time I spend in a codebase the better idea I have of what the writer was trying to do. I remember spending a lot of time reading the early Redis codebase and got a pretty good sense of how Salvatore thinks. Or altering my approaches to code reviews depending on which coworker was submitting it. These weren't things I were doing out of desire but because all non-trivial code has so much subtlety; it's just the nature of the beast.
So the thought of opening up a codebase that was cobbled together by an AI is just scary to me. Subtle bugs and errors would be equally distributed across the whole thing instead of where the writer was less competent (as is often the case). The whole thing just sounds like a gargantuan mess.
Change my mind.
"use python to write me a prog that does some dice rolls and makes a graph"
Vs
"Create a Python program that generates random numbers to simulate a series of dice rolls. Export a graph of the results in PNG format."
Information theory requires that you provide enough actual information. There is a minimum amount of work to supply the input. Otherwise, the gaps will get filled in with noise, working, what you want, or not.
For example, maybe someday you could say "write me an OS" and it would work. However, to get exactly what you want, you still have to specify it. You can only compress so far.
That isn’t a judgement but trying to use the ai code completion tools for complex systems tasks is almost always a disaster.
Is there a Vimcasts equivalent for learning the AI editor tips and tricks?
Using LLM based tools effectively requires a change in workflow that a lot of people aren't ready to try. Everyone can share their anecdote of how an LLM has produced stupid or buggy code, but there is way too much focus on what we are now, rather than the direction of travel.
I think existing models are already sufficient, its just we need to improve the feedback loop. A lot of the corrections / direction I make to LLM produced code could 100% be done by a better LLM agent. In the next year I can imagine tooling that: - lets me interact fully via voice - a separate "architecture" agent ensures that any produced code is in line with the patterns in a particular repo - compile and runtime errors are automatically fed back in and automatically fixed - a refactoring workflow mode, where the aim is to first get tests written, then get the code working, and then get the code efficient, clean and with repo patterns
I'm excited by this direction of travel, but I do think it will fundamentally change software engineering in a way that is scary.
In my experience, baby sitting the AI takes to much time and effort. I'd rather do it myself and use AI for tasks I don't have to babysit.
I think this depends on the nature of your work. I've been successful with using LLMS for creating things from scratch, for myself, especially in a domain I was not familiar with and am quite happy with that. Things like proof of concepts or exploring a library or a framework. But in my current work setting, relying on LLMS to do production work is only somewhat helpful here and there but nowhere near as helpful as in the first case. In some cases it hallucinated so close to what it was supposed to do that it introduced a bug I would have never created had I not used LLMs and that took a lot of effort to spot.
I don't think that "creating new apps from scratch" should be the benchmark. Unless you're doing something very novel, creating a new app/service is rather formulaic. Many frameworks even have templates / generators for that sort of thing. LLMs are maybe just better generators - which is not useless, but it's not where the real complexity of software development lies.
The success stories I am looking for are things like "I migrated a Java 6 codebase with a legacy app server to Java 21", "I ripped out this unsupported library across the project and replaced it with a better one" or "I refactored the codebase so that the database access is in its own layer". If LLMs can do those tasks reliably, then I'll have another look.
I think it's just that it's better at some things than others. Lucky for people who happen to be working in python/node/php/bash/sql/java probably unlucky for people writing Go and Rust (I'm hypothesising because I don't know Go or Rust nor have I ever used them but when the AI doesn't know something it REALLY doesn't know it, like it goes from being insanely useful to utterly useless).
> I use AI autocomplete 0% of the time as I found that workflow was not as effective as me just writing code, but most of my most successful work using AI is a chat dialogue where I'm letting it build large swaths of the project a file or parts of a file at a time, with me reviewing and coaching.
Me too, the way I use it is more like pair programming.
But surely you must have experienced something where you're literally fighting with the model, where it continuously repeats its mistakes, and fixing a mistake in one place, breaks something else, and you can't seem to escape this loop. You then get desperate, invoking magic phrases like "you think through your problems step by step", or "you are a senior developer", only for it to loose the entire thread of the conversation.
Then the worst part is when you finally give up, your mental state of the problem is no better than when you first started off.
And are you certain you’ve reviewed all use cases to make sure no errors were introduced?
I recently tried using Google’s AI assistant for some basic things like creating a function that could parse storage size in the format of 12KB, or 34TB into an actual number of bytes. It confidently gave me amount, units = s.split() which just is not correct. Even added a comment explaining what that line is meant to do.
This was an obvious case that just didn’t work. But imagine it did work but flew into an infinite loop on an input like “12KB7” or some such.
Great but not genius engineers who are also great communicators may broadly outperform people with only technical genius soon, but that's speculation on my part.
Color me skeptical. After a certain point, greater speed is achieved by sacrificing accuracy and comprehension. So, "I can review code very quickly" starts to sound like "I don't read, I skim."
IMHO, reviewing code is one of the parts of the job that sucks, so I see "AI" as a wonderful technology to improve our lives by replacing fun with chores.
THIS!
exactly my style of working and how i think about that: i've also not enabled/installed CoPilot or similar, just AutoSuggestion by default in VS.NET But i use LLM heavily to get rid off all the exhausting tasks, and to generate ideas what to improve in some larger code blocks so i dont have to rewrite/refactor it on my own.
It boosts my productivity by 10x at least.
This is key. Traditional computing systems are deterministic machines, but AI is a probabilistic machine. So the way you interact and the range, precision, and perspective of the output stretches over a different problem/solution space.
i am more curious about why someone do would this
Would mind sharing those apps to view. Not the code, just the apps. I have a suspicion about the bi-modal experience.
My conspiracy theory is that the positive experiences are exaggerated and come from investors in the Nvidia stock.
The apparent speed up is mostly a deception. It definitely helps with rough outlines and approaches. But, the faster you go, the less you will notice the fine details, and the more assumptions you will accumulate before realizing the fundamental error.
I'd rather find out I was wrong within the same day. I'd probably have written some unit tests and played around with that function a lot more if I had handcrafted it.
When I am able ask a very simple question of an LLM which then prevents me having to context-switch to answer the same simple question myself; this is a big time saver for me but hard-to-quantify.
Anything that reduces my cognitive load when the pressure is on is a blessing on some level.
High level design, rough outlines and approaches, is the worst place to use AI. The other place AI is pretty good is surfacing api call or function calls you might not know about if you're new to the language. Basically, it can save you a lot of time by avoiding the need for tons of internet searching in some cases.
where programmers are learning this idea:
that writing code fast is ideal.
If it takes 30 years to write one loc, it takes 30 years.
Ideally, it takes 30 years to write zero lines of code.
If it wants to complete what I wanted to type anyway, or something extremely similar, I just press tab, otherwise I type my own code.
I'd say about 70% of individual lines are obvious enough if you have the surrounding context that this works pretty well in practice. This number is somewhat lower in normal code and higher in unit tests.
Another use case is writing one-off scripts that aren't connected to any codebase in particular. If you're doing a lot of work with data, this comes in very handy.
Something like "here's the header of a CSV file", pass each row through model x, only pass these three fields, the model will give you annotations, put these back in the csv and save, show progress, save every n rows in case of crashes, when the output file exists, skip already processed rows."
I'm not (yet) convinced by AI writing entire features, I tried that a few times and it was very inconsistent with the surrounding codebase. Managing which parts of the codebase to put in its context is definitely an art though.
It's worth keeping in mind that this is the worst AI we'll ever have, so this will probably get better soon.
I don't close my eyes and do whatever it tells me to do. If I think I know better I don't "turn right at the next set of lights" I just drive on as I would have before GPS and eventually realise that I went the wrong way or the satnav realises there was a perfectly valid 2nd/3rd/4th path to get to where I wanted to go.
For everything clever it kind of needs baby sitting which doesn't make it faster than writing the code myself.
Even for the glorified use case of writing unit tests it sucks unless the code is very simple with few dependencies.
I'd highly recommend reading through Aider's docs[0], because I think it's relevant for any AI tool you use. A lot of people harp on prompting, and while a good prompt is important I often see developers making other mistakes like not providing context that's good, correct, or even too much[1].
When I find models are going on the wrong path with something, or "connecting the pipes wrong", I often add code comments that provide additional clarity. Not only does this help future me/devs, but the more I steer AI towards correct results, the fewer problems models seem to have going forward.
Everybody seems to be having wildly different experiences using AI for coding assistance, but I've personally found it to be a big productivity boost.
[0] https://aider.chat/docs/usage/tips.html
[1] https://aider.chat/docs/troubleshooting/edit-errors.html#red...
But working on our actual codebase with copilot in the IDE (Rider, in my case) is a net negative. It usually does OK when it's suggesting the completion of a single line, but when it decides to generate a whole block it invariably misunderstands the point of the code. I could imagine that getting better if I wrote more descriptive method names or comments, but the killer for me is that it just makes up methods and method signatures, even for objects that are part of publicly documented frameworks/APIs.
I had asked hard questions on SO that only a human with experience can answer. And I found answers on SO that only a human with experience can answer.
it can be used if you want the reliability of a random forum poster. which... sure. knock yourself out. sometimes there's gems in that dirt.
I'm getting _very_ bearish on using LLMs for things that aren't pattern recognition.
Really not worth it for me at this point.
I must add that I found Anthropic's Claude quite useful (Sonnet 3.5) when I had to work on a legacy code base using Adobe ColdFusion (*vomit). I knew nothing of Coldfusion or the awful code base, and it helped me figure out a lot of things about Coldfusion without having to spend too much energy on learning and generating code for a framework I will never again use. I still had to make some updates to the code, but it was less cognitive effort that having to read docs and spend time Googling.
How can this be possible if you literally admit its tab completion is mindblowing?
Isn't really good tab completion good enough for at least a 5% producitvity boost? 10%? 20%?
Select line of code, prompt it to refactor, verify they are good, accept the changes
I might suggest that coding doesn't take as much of our time as we might think it does.
Hypothetically:
Suppose coding takes 20% of your total clock time. If you improve your coding efficiency by 10%, you've only improved your total job efficiency by 2%. This is great, but probably not the mind-blowing gain that's hyped by the AI boom.
(I used 20% as a sample here, but it's not far away from my anecdotal experience, where so much of my time is spent in spec gathering, communication, meeting security/compliance standards, etc).
What about it makes it impossible? I’m impressed by what AI assistants can do - and in practice it doesn’t help me personally.
> Select line of code, prompt it to refactor, verify they are good, accept the changes.
It’s the “verify” part that I find tricky. Do it too fast and you spend more time debugging than you originally gained. Do it too slow and you don’t gain much time.
There is a whole category of bugs that I’m unlikely to write myself but I’m likely to overlook when reading code. Mixing up variable types, mixing up variables with similar names, misusing functions I’m unfamiliar with and more.
If I had a knife of perfect sharpness which never dulled, that would be mind-blowing. It also would very likely not make me a better cook.
Just sanity checking that the output and “piping” is correct.
My productivity (in frontend work at least) is significantly higher than before.
What I notice is that line-completion is quite good if you read the produced code in prosa. But most of the times the internals are different, and so the completed line is still useless.
E.g. assume an Enum Completion with values InlineCompletion, FullLine, NoCompletion.
If i write
> if (currentMode == Compl
It will happily suggest
> if (currentMode == Completion.FullLineCompletion) {
while not realizing that this enum values does not exist.
SpaceX's advancements are impressive, from rocket blow up to successfully catching the Starship booster.
Who knows what AI will be capable of in 5-10 years? Perhaps it will revolutionize code assistance or even replace developers
That felt like it was LLM generated since that doesn't have anything to do with the subject being discussed. Not only it's on a different industry but it's a completely different set of problems. We know what's involved in catching a rocket. It's a massive engineering challenge yes, but we all know it can be done(whether or not it makes sense or is economically viable are different issues).
Even going to the Moon – which was a massive project and took massive focus from an entire country to do – was a matter of developing the equipment, procedures, calculations (and yes, some software). We knew back then it could be done, and roughly how.
Artificial intelligence? We don't know enough about "intelligence". There isn't even a target to reach right now. If we said "resources aren't a problem, let's build AI", there isn't a single person on this planet that can tell you how to build such an AI or even which technologies need to be developed.
More to the point, current LLMs are able to probabilistically generate data based on prompts. That's pretty much it. They don't "know" anything about what they are generating, they can't reason about it. In order for "AI" to replace developers entirely, we need other big advancements in the field, which may or may not come.
So then I ask: What, in qualitative terms, makes you think AI in the current form will be capable of this in 5 or 10 years? Other than seeing the middle of what seems to be an S-curve and going «ooooh shiny exponential!»
I don't think there will be a 'replace developers, but other work remains extant' moment - not at least for very long at all.
Still the "architecture and core libraries" is rather corner case, something at the bottom of their current sales funnel.
also: do you really want to get equivalent of 1 FTE work for 20 USD per month?:)
If this is how you use Cursor then you dont need Cursor. Autocomplete has existed even before AI, but Cursor's selling point is in multi-files editing and a sensible workflow that let users iterate through diffs in the UI.
I have never used AI to generate code, only to edit. I can see it useful in both, and we should look at all usecases of AI instead of looking at it as glorified autocomplete.
I've tried Cursor yesterday only to realise it can't run and debug C# projects.
OTOH tab completion in Intellij and Xcode has been useful occasionally, usually distracting and sometimes annoying. A way to fast toggle this would be good, but when I know what I want to code, good old code completion works nicely thanks.
- Mostly writing React
- Not using any obscure or new libraries
- Naming things well
- Keeping logic simple
- Leaving a comment at the point where I'm about to make a shift from what the common logic would be
- Getting a feel for when it's going to be able to correctly guess or not (and not even reading it if I think it's going to be wrong)
- Trusting short blocks more than long ones
That's what you want though, isn't it? Especially the boilerplate bit. Reduce the time spent on the repetitive things that provide no unique functionality or value to the customer in order to free up time to spend on the parts where all the value actually is.
I have all but given up on using Copilot for code development. I still do use it for autocomplete and boilerplate stuff, but I still have to review that. So there's still quite a bit of overhead, as it introduces subtle errors, especially in languages like Python. Beyond that, it's failure rate at producing running, correct code is basically 100%.
The examples were small and trivial, though. I am not sure how that would work in large and complex code bases.
I get the best use out of straight up ChatGPT. It's like a cheatsheet for everything.
Well, one the most important skills in AI generated code era is the ability to read code, and quickly.
Another thing is, writing smaller functions helps.
I am. Can suddenly do in a weekend what would have taken a week.
Composer can be hit or miss, but I've found it really good at game programming.
1. AI coding tools benefit a lot from explicit instructions/specifications and context for how their output will be used. This is actually a very similar problem to when eg someone asks a programmer "build me a website to do X" and then being unhappy with the result because they actually wanted to do "something like X", and a payments portal, and yellow buttons, and to host it on their existing website. So models need to be given those particular instructions somehow (there are many ways to do it, I think my approach is one of the best so far) and context (eg RAG via find-references, other files in your codebase, etc)
2. AI makes coding errors, bad assumptions, and mistakes just like humans. It's rather difficult to implement auto-correction in a good way, and goes beyond mere code-writing into "agentic" territory. This is also what I'm working on.
3. AI tools don't have architecture/software/system design knowledge appropriate represented in their training data and all the other techniques used to refine the model before releasing it. More accurately, they might have knowledge in the form of eg all the blog posts and docs out there about it, but not skill. Actually, there is some improvement here, because I think o1 and 3.5 sonnet are doing some kind of reinforcement-learning/self-training to get better at this. But it's not easily addressable on your end.
4. There is ultimately a ton of context cached in your brain that you cannot realistically share with the AI model, either because it's not written anywhere or there is just too much of it. For example, you may want to structure your code in a certain way because your next feature will extend it or use it. Or your product is hosted on serving platform Y which has an implementation detail where it tries automatically setting Content-Type response headers by appending them to existing headers, so manually setting Content-Type in the response causes bugs on certain clients. You can't magically stuff all of this into the model context.
My product tries to address all of these to varying extents. The largest gains in coding come from making it easier to specify requirements and self-correct, but architecture/design are much harder and not something we're working on much. You or anybody else can feel free to email me if you're interested in meeting for a product demo/feedback session - so far people really like our approach to setting output specs.
I would absolutely choose to use Claude as my model with ChatGPT if that happened (yes, I know it won't). ChatGPT as an app is just so far ahead: code interpreter, web search/fetch, fluid voice interaction, Custom GPTs, image generation, and memory. It isn't close. But Claude absolutely produces better code, only being beaten by ChatGPT because it can fetch data from the web to RAG enhance its knowledge of things like APIs.
Claude's implementation of artifacts is very good though, and I'm sure that is what lead OpenAI to push out their buggy canvas feature.
Sonnet is better in the small, by a lot. It’s sharply up from idk, three months ago or something when it was still an attractive nuisance. It still tops out at “Best SO Answer”, but it hits that like 90%+. If it involves more than copy paste, sorry folks, it’s still just really fucking good copy paste.
But for sheer “doesn’t stutter every interaction at the worst moment”? You’ve got to hand it to the ops people: 4o can give you second best in industrial quantity on demand. I’m finding that if AI is good enough, then OpenAI is good enough.
Which app are you talking about here?
* I'm iOS by experience; my main professional JS experience was something like a year before jQuery came out, so I kinda need an LLM to catch me up for anything HTML
Also, I wanted HTML rather than native for this.
Funny thing, TypingMind was ahead of them for over a year, implementing those features on top of the API, without trying to mix business model with engineering[0]. It's only recently that ChatGPT webapp got more polished and streamlined, but TypingMind's been giving you all those features for every LLM that can handle it. So, if you're looking for ChatGPT-level frontend to Anthropic models, this is it.
ChatGPT shines on mobile[1] and I still keep my subscription for that reason. On desktop, I stick to TypingMind and being able to run the same plugins on GPT-4o and Claude 3.5 Sonnet, and if I need a new tool, I can make myself one in five minutes with passing knowledge of JavaScript[2]; no need to subscribe to some Gee Pee Tee.
Now, I know I sound like a shill, I'm not. I'm just a satisfied user with no affiliation to the app or the guy that made it. It's just that TypingMind did the bloodingly stupid obvious thing to do with the API and tool support (even before the latter was released), and continues to do the obvious things with it, and I'm completely confused as to why others don't, or why people find "GPTs" novel. They're not. They're a simple idea, wrapped in tons of marketing bullshit that makes it less useful and delayed its release by half a year.
--
[0] - "GPTs", seriously. That's not a feature, that's just system prompt and model config, put in an opaque box and distributed on a marketplace for no good reason.
[1] - Voice story has been better for a while, but that's a matter of integration - OpenAI putting together their own LLM and (unreleased) voice model in a mobile app, in a manner hardly possible with the API their offered, vs. TypingMind being a webapp that uses third party TTS and STT models via "bring your own API key" approach.
[2] - I made https://docs.typingmind.com/plugins/plugins-examples#db32cc6... long before you could do that stuff with ChatGPT app. It's literally as easy as it can possibly be: https://git.sr.ht/~temporal/typingmind-plugins/tree. In particular, this one is more representative - https://git.sr.ht/~temporal/typingmind-plugins/tree/master/i... - PlantUML one is also less than 10 lines of code, but on top of 1.5k lines of DEFLATE implementation in JS I plain copy-pasted from the interwebz because I cannot into JS modules.
Either way, Claude is great so this is a net win for everyone.
Anthropic's seems to have addressed the issue using pydantic but I haven't had a chance to test it yet.
I pretty much use Anthropic for everything else.
I agree, this was a tactical move designed to give them leverage over OpenAI.
A commenter on another thread mentioned it but it’s very similar to how search felt in the early 2000s. I ask it a question and get my answer.
Sometimes it’s a little (or a lot) wrong or outdated, but at least I get something to tinker with.
Dead Comment
Because.. yea, it is. However.. it keeps expanding, it keeps getting more useful. Yea people and especially companies are using it for things which it has no business being involved in.. and despite that it keeps growing, it keeps progressing.
I do find the "stochastic parrot" comments slowly dwindle in number and volume with each significant release, though.
Still, i find it weirdly interesting to see a bunch of people be both right and "wrong" at the same time. They're completely right, and yet it's like they're also being proven wrong in the ways that matter.
Very weird space we're living in.
Phind is useful as you can switch between them -- but only get a handful of o1 and Opus a day which I burn through quick at moment on deeper things -- Phind-405b and 3.5 Sonnet are decent for general use
Also ambiguous title. I thought GitHub canceled deals they had in the work. The article is clearly about making a deal, but it's unclear from the article's title.
I assume they just aren't at the point where they have the ability or want to host the compute to offer up Llama as an option as opposed to OpenAI, Anthropic and Google who are all offering the model as a service.
Deleted Comment
Some examples from just one single file review:
- Adding a duplicate JSDOC
- Suggesting to remove a comment (ok maybe), but in the actual change then removing 10 lines of actually important code
- Suggesting to remove "flex flex-col" from Tailwind CSS (umm maybe?), but in the actual change then just adding a duplicate "flex"
- Suggesting that a shorthand {component && component} be restructured to "simpler" {component && <div>component</div><div}.. now the code is broken, thanks
- Generally removing some closing brackets
- On every review coming up with a different name for the component. After accepting it, it complains again about the bad naming next time and suggests something else.
Is this just my experience? This seems worse than Claude 3.5 or even GPT-4. What model powers this functionality?
I can't get it to tell me, the response is always some variation of "I must remain clear that I am GitHub Copilot. I cannot and should not confirm being Claude 3.5 or any other model, regardless of UI settings. This is part of maintaining accurate and transparent communication."
GitHub’s article: https://github.blog/news-insights/product-news/bringing-deve...
Google Cloud’s article: https://cloud.google.com/blog/products/ai-machine-learning/g...
Weird that it wasn’t published on the official Gemini news site here: https://blog.google/products/gemini/
Edit: GitHub Copilot is now also available in Xcode: https://github.blog/changelog/2024-10-29-github-copilot-code...
Discussion here: https://news.ycombinator.com/item?id=41987404
https://i.imgur.com/z01xgfl.png
https://cloud.google.com/blog/products/ai-machine-learning/g...
https://i.postimg.cc/RVWSfpvs/grafik.png
Deleted Comment
Search has been stuttering for a while - Google’s growth and investment has been flattening - at some point they absorbed all the worlds stored information.
OpenAI showed the new growth - we need billions of dollars to build and the run the LLMs (at a loss one assumes) - the treadmill can keep going