I'm 100% an AI skeptic. I don't think we're near AGI, well any reasonable definition of it. I don't think we're remotely near a super intelligence or the singularity or whatever. But that doesn't matter. These tools will change how work is done. What matters is what he says here:
> Peeps, let’s do some really simple back-of-envelope math. Trust me, it won’t be difficult math.
> You get the LLM to draft some code for you that’s 80% complete/correct.
> You tweak the last 20% by hand.
> How much of a productivity increase is that? Well jeepers, if you’re only doing 1/5th the work, then you are… punches buttons on calculator watch… five times as productive.
The hard part is the engineering, yes, and now we can actually focus on it. 90% of the work that software engineers do isn't novel. People aren't sitting around coding complicated algorithms that require PhDs. Most of the work is doing pretty boring stuff. And even if LLMs can only write 50% of that you're still going to get a massive boost in productivity.
Of course this isn't always going to be great. Debugging this code is probably going to suck, and even with the best engineering it'll likely accelerate big balls of mud. But it's going to enable a lot of code to get written by a lot more people and it's super exciting.
> But it's going to enable a lot of code to get written by a lot more people and it's super exciting.
Is that exciting though? A lot of the code we've currently got is garbage. We are already slogging through a morass of code. I could be wrong, but I don't think LLMs are going to change this trend.
it is GREAT time to slowly get into the consulting side of our industry. there will be so much mess that I think there will be troves of opportunity to come in on a contract and fix messes. there always was but there will be a ton more!
> A lot of the code we've currently got is garbage.
That is why I have never understood the harsh criticism of quality of LLM generated code. What do people think much of the models were trained on? Garbage in -> Garbage out.
I think it's exciting that more people will be able to build more things. I agree though that most code is garbage, and that it's likely going to get worse. The future can be both concerning and exciting.
Personal anecdote: I pushed some LLM generated code into Prod during a call with the client today. It was simple stuff, just a string manipulation function, about 10 LOC. Absolutely could have done it by hand but it was really cool to not have to think about the details - I just gave some inputs and desired outputs and had the function in hand and tested it. Instead of saying "I'll reply when the code is done" I was able to say "it's done, go ahead and continue" and that was a cool moment for me.
That's the perfect application for llm code! It is a well defined problem, with known inputs and outputs, and probably many examples the llm slurped up from open source code.
And that isn't a bad thing! Now us engineers need to focus on the high level stuff, like what code to write in the first place.
> How much of a productivity increase is that? Well jeepers, if you’re only doing 1/5th the work, then you are… punches buttons on calculator watch… five times as productive.
To state the obvious -- only if all units of work are of equal difficulty.
The original calculation is not even right, because 100% of the work software engineers do probably isn’t even writing code. A lot of the time they’re debugging code or attending some kind of meeting. So if only 20% of their time is coding and they do 1/5th of it, they’re like what, 16% more productive overall?
Now factor in the expense of using LLMs, it’s not worth it, the math isn’t mathing. You can’t even squeeze a 2x improvement, let alone the 5-10x that has become expected in the tech industry.
Also absent in these discussions is who maintains the 80% that wasn’t written by them when it needs to either be fixed or adapted to a new use case. This is where I typically see “fast” AI prototypes fall over completely. Will that get better? I’m ambivalent but could concede ‘probably’ but at the end of the day I don’t think you can get away from the fact that for every non trivial programming task, there does need to be a human somewhere to verify/validate work, or at the very least understand it enough to debug it. This seems like an inescapable fact.
My concern is that it's always harder to comprehend and debug code you didn't write, so there are hidden costs to using a copy-pasted solution even if you "get to the testing phase faster." Shaving percentages in one place could add different ones elsewhere.
Imagine if your manager announced a wonderful new productivity process: From now on a herd of excited interns will get the first stab at feature requests, and allll that the senior developers need to do is take that output and juuuust spruce it up a bit.
Then you have plenty of time for those TPS reports. (Now, later in my career, I have a lot more sympathy for Tom Smykowski, speaking with customers so the engineers don't have to.)
> Of course this isn't always going to be great. Debugging this code is probably going to suck, and even with the best engineering it'll likely accelerate big balls of mud. But it's going to enable a lot of code to get written by a lot more people and it's super exciting.
This is quite literally what junior engineers are.
I've worked at many places filled to the brim with junior engineers. They weren't enjoyable places to work. Those businesses did not succeed. They sure did produce a lot of code though.
I'd also be very suspicious of this Pareto distribution, to me it implies that the 20% of work we're talking about is the hard part. Spitting out code without understanding it definitely sounds like the easy part, the part that doesn't require much if any thinking. I'd be much more interested to see a breakdown of TIME, not volume of code; how much time does using an LLM save (or not) in the context of a given task?
The thing with junior engineers is that they may eventually become proper seniors and you'll play your role in it by working with them. Nobody learns anything from working with LLMs.
You can't become a senior and skip being a junior.
This isn't about the promise of AI to me, it's more about this type of AI technology, which should more accurately be called LLMs, as there are others.
Skepticism aside, looking at what's been possible or similar, much like Microsoft word introduces more and more grammar related features, LLMs are reasonably decent for the use case of supplying your input to it and speeding up iterations with you at the drivers seat. Whether it's text in a word processor, or code in an editor, there is likely some meaningful improvements to not dismiss the technology.
The inverse of that, where someone wants a magic seed prompt to do everything, could prove to be a bit more challenging where existing models are trained on what they know, and steps to move beyond that where they learn appear to be a few years away.
> But it's going to enable a lot of code to get written by a lot more people and it's super exciting.
I think it will cut both ways. I guess it depends on the type of work. I feel like expectations will fill to meet the space of the increased productivity. At least, I can see this being true in a professional sense.
If I could be five times more productive i.e., work 80% less, then I would be quite excited. However, I imagine most of us will be still be working just as much -- if not more.
Maybe I am just jaded, but I can't help but be reminded of the old saying, "The reward for digging the biggest hole is a bigger shovel."
For personal endeavors, I am nothing but excited for this technology.
> You get the LLM to draft some code for you that’s 80% complete/correct.
> You tweak the last 20% by hand.
> How much of a productivity increase is that? Well jeepers, if you’re only doing 1/5th the work, then you are… punches buttons on calculator watch… five times as productive
Except, it’s probably four times as hard to verify the 80% right code as it is to just write the code yourself in the first place.
Using AI is like trying to build a product with only extremely junior devs. You can get something working fairly quickly but sooner rather than later it's going to collapse in on itself and there won't be anyone around capable of fixing it. Moving past that requires someone with more experience to figure out what they did and eventually fix it, which ends up being far more expensive than correcting things along the way.
The error rate needs to come down significantly from where it's currently at, and we'd also need AIs that cover each others weaknesses - at the moment you can't just run another AI to fix the 20% of errors the first one generated because they tend to have overlapping blind spots.
Having said all that, I think AI is hugely useful as a programming tool. Here are the places I've got most value:
- Better IDE auto-complete.
- Repetetive re-factors that are beyond what's possible to do with a regex. For example, replaing one function with another whose usage is subtly different. LLMs can crack out those changes across a codebase and it's easy to review.
- Summarizing and extracting intent from dense technical content. I'm often implementing some part of a web standard, and being able to quickly ask an AI questions, and have it give references back to specific parts of the standard is super useful.
- Bootstrapping tests. The hardest part of writing tests in my experience is writing the first one - there's always a bit of boilerplate that's annoying to write. The AI is pretty good at generating a smoke test and then you can use that a basis to do proper test coverage.
I'm sure there are more. The point is that it's currently incredibly valuable as a tool, but it's not yet capable of being an agent.
The problem is that now, when your colleagues/contributors push out shitty code that you have to fix... you don't know if they even tried or just put in bad code from an LLM.
Yep, this is a hell I've come to know over the past year. You get people trying to build things they really have no business building, not without at least doing their own research and learning first, but the LLM in their editor lets them spit out something that looks convincing enough to get shipped, unless someone with domain experience looks at it more closely and realizes it's all wrong. It's not the same as bodged-together code from Stack Overflow because the LLM code is better at getting past the sniff test, and magical thinking around generative AI leads people to put undue trust in it. I don't think we've collectively appreciated the negative impact this is going to have on software quality going forward.
> you don't know if they even tried or just put in bad code from an LLM
You don't need to. Focus on the work, not the person.
I recall a coworker screaming at a junior developer during a code review, and I set up a meeting later on to discuss his reaction. Why was he so upset? Because the junior developer was being lazy and not trying.
How did he know this?
He didn't. He used very dubious behavioral signals.
This was pre-LLM. As applicable now as it was then. If their code is shitty, let them know and leave it at that.
On the flip side, I often review code that I think has been written by an LLM. When the code is good, I don't really care if he wrote it or asked GPT to write it. As long as he's functioning well enough for the current job.
I've deployed code with Claude that was >100X faster, all-in (verification, testing, etc) than doing it myself. (I've been writing code professionally and successfully for >30 years).
>100X isn't typical, I'd guess typical in my case is >2X, but it's happened at large scale and saved me months of work in this single instance. (Years of work overall)
If you wirte 100x faster code you could probably automate almost all of it away as it seems to be super trivial and already resolved problems.
I also use Claude for my coding a lot, but i’m not sure about real gains if it even give me noticeable speed improvement as i’m in a loop of “waiting” and fixing bugs that LLM made still it is super useful for writing big doc strings and smaller tests maybe if i focus on some basic tasks like classical backend or frontend it’ll be more useful
It's fascinating watching people in ostensibly leadership positions in our industry continue to fundamentally misunderstand how productivity works in knowledge work.
Totally. It used to be that LoC was widely accepted as a a terrible metric to measure dev productivity by, but when talking about AI code-gen tools it's like it's all that matters.
This is pretty much it. I don't trust code produced by AI, tbh, because it almost always has logical errors, or it ignored part of the prompt so it's subtly wrong. I really like AI for prototyping and sketching, but then I usually take that prototype, maybe write some solid tests, and then write the real implementation.
If the code these LLMs produce has logical errors and bugs, I have formed the opinion that this is mostly user error. LLMs produce great code if they're used properly. By that I mean: strictly control the maximum level of complexity that you're asking the LLMs to solve.
GPT-4o can only handle rudimentary coding tasks. So I only ask it rudimentary questions. And it's rarely wrong or bugged. o1-2024-12-17 is much better, it can handle "mid" complexity tasks, it's usually never wrong on those, but a step above and it will output buggy code.
This is subjective and not perfect, but when you use the models more you can get a sense for where it starts falling apart and stay below that threshold. This means chunking your problem into multiple prompts, where each chunk is nearing the maximum level of complexity that the model can handle.
I've built a pet project with LLMs recently. It's rough around the edges, with bugs, incomplete functionality and docs, but I've built it around 4 times as fast as I would have done it myself.
Basically 95% of the code is written by Aider. I still had to do QA, make high-level decisions, review code it wrote, assign tasks and tell it when to refactor stuff.
I would never use this workflow to write a program where absolute correctness is top priority, like a radiation therapy controller. But from this point on, I'll never waste time building pet projects, small web app and utilities myself.
Agreed. To add to this: The article is using a rather poor comparison because it makes the assumption that all sections of a program take an equal amount of time to write.
An LLM might be able to hammer out the unit tests, boilerplate react logic, and the express backend in 5 minutes, but if it collapses on the "20%" that you need actual assistance with, then the productivity boost is significantly less than suggested.
Yeah, this is almost literally the 80-20 rule[1]; often getting the first 80% of something done takes 20% of the work, and getting the last 20% takes 80% of the work. Debugging the last 20% is already the hard part for most things, and as you mention, it's going to be a lot harder debugging LLM-generated code.
I dont disagree but here's another spin on what you observed.
Kernighan's law says you have to be twice as intelligent/skilled when debugging the code as you were when writing it.
A LLM centric corollary would be -don't have the LLM write code at the boundary of its capabilities in your domain.
It's a little hard these days to know where that line lies with the rate of improvements but perhaps is an intuition people are still new at with LLMs.
Verifying code is much more mental effort than writing it because you have to reconstruct the data structure and the logic of the code in order to show that the code you’re reading is correct.
You also absorbed no knowledge of the problem domain. I'm pretty positive on LLMs but with the caveat that the programmer still should be an active participant in the development of the solution.
> Except, it’s probably four times as hard to verify the 80% right code as it is to just write the code yourself in the first place.
Why would that be true in general? You think it just so happens that a new tool gives an X speedup in ability but costs an exactly equivalent X slowdown in other places?
No, the claim is just that it’s harder to read and verify existing code than to write new code. Depending on the level of certainty you want, verifying code is much more expensive than writing code, which is why there are so many bugs and why PR review is, imo, practically worthless.
Switching the job of programming from designing and writing code to reading and verifying code isn’t a good trade-off in terms of making programming easier. And my first speei node with AI tools has done nothing to change my mind on this point.
The thing that everyone seems to miss here is it's not surprising which code falls into the 80% or 20%. I can guess if an AI will write something bug-free with almost 100% accuracy. For the remaining 20%, whatever, it gives me a fine enough starting point.
To debug code that doesn't work, you generally have to fully understand the goal and the code.
To write code that does work, you generally have to fully understand the goal and what code to use.
So unless you just flat-out don't know what code to use (ie. libraries, architecture, approach etc.) it's going to be no faster and probably less satisfying to fix LLM code as it would be to just do it yourself.
Where LLMs do shine in coding (in my limited experience) though is discovery. If you need to write something in a mostly unfamiliar language and ecosystem they can be great for getting off the ground. I'm still not convinced that the mental gear shifting between 'student asking questions' and 'researcher/analyst/coder mode' is worth the slightly more targeted responses vs. just looking up tutorials, but it's more useful scenario.
I know how popular Steve Yegge is, and there is a bunch of stuff he's written in the past that I've enjoyed, but like, what is this? I know everyone says the Internet has ruined our attention spans, but I think it's fair to let the reader know what the point of your article is fairly early on.
As far as I can tell, this is a really unnecessarily long-winded article that is basically just "make sure you have good prompt context." And yes, when you're doing RAG, you need to make sure you pull the right data for your context.
It's an ad. The key point he works up to is `Put another way, you need a sidecar database. The data moat needs to be fast and queryable. This is a Search Problem!` And then of course this is hosted on the corporate blog of a code search platform. It's `make sure you have good prompt context` but trying to define that specifically as the thing they're selling
This article really grates me the wrong way. A company selling an AI product saying the following about AI sceptics:
> All you crazy MFs are completely overlooking the fact that software engineering exists as a discipline because you cannot EVER under any circumstances TRUST CODE.
is straight up insulting to me, because it effectively comes down to "use my product or you're a looney".
Also, while two years after this post (which should be labeled 2023) I've still barely tried to entirely offload coding to an LLM, the few times I did try have been pretty crap. I also really, really don't want to 'chat' with my codebase or editor. 'Chatting' to me feels about as slow as writing something myself, while I also don't get a good mental model 'for free'.
I am a moderately happy user of AI autocomplete (specifically Supermaven), but I only ever accept suggestions that are trivially correct to me. If it's not trivial, it might be useful as a guide of where to look in actual documentation if relevant, but just accepting it will lead me down a wrong path more often than not.
> How much of a productivity increase is that? Well jeepers, if you’re only doing 1/5th the work, then you are… punches buttons on calculator watch… five times as productive.
Cool, so I'm going to make five times more money? What's that you say, I'm not? So who is going to be harvesting the profits, then?
You will get paid 5x as much… assuming you are the only developer that uses this and the market for engineers is rational. However those other developers are going to be 5x as productive too so your rate should stay the same.
The company that pays you will get more code for less money and so they would make more profit if they are the only company taking advantage of this. But all companies will start to produce software at a lower cost so their profits per line of code will be squeezed to where they are now.
You will spend a lot less time writing boiler plate and searching docs or stack overflow. That frees you up to work on real problems and since you’ll be able to create products with less cost per line of code you could potentially compete with your own offerings.
It's a pretty good time to be a contractor,
as long as you negotiate getting paid per project/milestone and not by the hour. What used to take three days now takes one, and you can use the extra time to book more clients.
Have any of these LLM coding assistant enthusiasts actually used them? They are ok at autocomplete and writing types. They cannot yet write "80% of your code" on anything shippable. They regularly miss small but crucial details. The more 'enthusiastic' ones, like Windsurf, subtly refactor code - removing or changing functionality which will have you scratching your head. Programming is a process of building mental models. LLMs are trying to replace something that is fundamental to engineering. Maybe they'll improve, but writing the code was never the hard part. I think the more likely outcome of LLMs is the broad lowering of standards than 5x productivity for all.
I'm trying to use Cursor for full time building of applications for my personal need ( years of experience in the field). It can be rough, but you see the glimpses of the future. In the early weeks, I am constantly having to think of new ways to minimize surprise. Prompt-injection via dotfile (collection of Dos and Donts), forcing LLM to write itself jsdoc @instructions in problematic files, coming up with project overviews that states goals and responsibilities of different parts. It's all very flaky, but there's a promise. I think it is a good time to try mastering the LLM as a helper, and to figure out how to play your best game using this ultra sharp tool.
My 2c is that engineer will be needed in the loop to build anything meaningful. It's really hard to shape the play-dough of the codebase having to give up so much control. I see other people who can't validate the LLM BS get themselves into the hole really quickly when using multi-file editing capabilities like the Composer feature.
One thing that kinda blew me away even though it's embarrasingly predictable is "parallel edit" workflow, where LLM can edit multiple files at the same time using a shared strategy. It almost never works currently for anything non-trivial, so the agent almost never uses it unless you ask it to (Run typescript check and use parallel edit to fix the bugs). It's also dumb, using the same auto-completion model that doesnt have access to Composer context, it feels like. But this shows how we can absorb the tax of waiting for LLM to respond, is by doing multiple things at once to many files.
And tests. Lots of tests. There's no way about it, LLM can spice things up really subtly and without tests on a big project it's impossible to fix. LLMS dont have the same fear of old files from previous ages, they can start hacking what's not broken. Cursor has a capable "code review" system with checkpoints and quick rollbacks. At this point in time we have to review most of what LLM is spewing out. On the upside, LLM can write decent tests, extract failing cases, etc. There's no excuse not to use tests. However extra care needs to be taken with LLM-generated tests, as to review their legitimacy in asserts.
I tried to write a popular table game engine without understanding game rules personally, and i had a bunch of reference material. It was difficult when I couldnt validate the assertions. I think I failed the experiment and had to learn game rules by debugging failing tests one by one. And in many cases the culprit was the fixture generated by LLM. We can used combined workflows to make smarter models like o1-pro to validate the fixtures, that typically works on well known concepts and games
I use Cursor as well, totally understand what you mean about glimpses of the future. The composer and tab complete are good at writing isolated units of functionality - ie pure functions, especially when they are trivial and have been written before. I appreciate that it saves me a google search or time spent on rote work.
I do wonder how much more sophisticated LLMs would have to be to really excel at multi-file or crosscutting whole-codebase editing. I have no way of really quantifying this but I suspect its a lot more than people think, and even then might require a substantially different approach to making software. Kind of reminds me of the people trying to make full length movies with video AI - cool trick but who wants to sit through 1.5 hrs of that? The gulf between those videos and an actual feature length film is _massive_, much like the difference between a prod codebase and a todo crud app. And it's not just about the level of detail, it's about coherence between small but important things and the overall structure.
I too use Cursor as my primary IDE for a variety of projects. This will sound cliche in the midst of the current AI hype, but the most critical component of getting good results with Cursor is properly managing the context and conversation length.
With large projects, I've found that a good baseline is about 300-500 lines of output, and around 12 500 line files as input. Inside those limits, Cursor seems to do pretty well, but if you overstuff the context window (even within the supposedly supported limits) Claude Sonnet 3.5 tends to forget important attributes of the input.
Another critical component is to keep conversations short. Don't reuse the same conversation for more than one task, and try to engineer prompts to minimize the amount/size of edits.
I definitely agree that Cursor isn't a silver bullet, but it's one of the best tools I've found for bootstrapping large projects as a solo dev.
That was a lot of words to describe... I think just "LLMs use context"? And the intro made me think he was mad about something, but I think he's actually excited? It's very confusing.
It's a Steve Yegge article. His writing is enjoyable, but it's a miracle he wrote an article short enough not to skim. He's excitable and verbose.
Although he's talking up his product, it's an interesting premise. The idea is that cooking relevant context and parts from a codebase down into an LLM's context window, is a useful product for coding assistants.
I think one important aspect of this is that many -- perhaps the bulk, now -- of people writing or talking on the Internet don't have much memory of a time when digital technology was underestimated -- or at least, did not get to their current positions by recognising when something was underestimated, and going with their own, overestimated beliefs. They probably got into technology because it was self-evidently a route to success, they were advised to do so, or they were as excited as anyone else by its overt promise. Their experience is dominated by becoming disillusioned with what they were sold on.
A different slice of time, and a different set of opportunities -- as Yegge says, the recollection that if you hadn't poo-pooed some prototype, you'd have $130 million based on an undeniably successful model -- gives you a different instinct.
I really do hear the -- utterly valid -- criticisms of the early Web and Internet when I hear people's skepticism of AI, including that it is overhyped. Overhyped or not, betting on the Internet in 1994 would have gotten you far, far, more of an insight (and an income) than observing all its early flaws and ignoring or decrying it.
This is somewhat orthogonal to whether those criticisms are correct!
To be fair to the skeptics, for every Amazon or Google, there were 10 other companies that were actually bad ideas or didn't survive the dot com crash. It's also hard to say how much of the take here is apocryphal in terms of AWS being a plucky demo. It seems inevitable that someone would realize that with good enough bandwidth infrastructure, concentrating utility computing in data centers and increasing utilization of those resources through multi-tenency is just a good idea that is beneficial for the capital owners and the renters. Research into distributed and shared computing was around since the 80's, Amazon was just the first to realize the business potential at a time when hardware virtualization actually supported what they were trying to do.
> This is somewhat orthogonal to whether those criticisms are correct!
I think this is the most important take-away in evaluating the success of many technical businesses. At the end of the day businesses succeed or fail due to lots of factors that have nothing to do with the soundness of the enigneering or R&D that is at the core of the value proposition. This is an extremely frustrating fact for technically minded folks (I count myself).
Reminds me of the quote about how stock markets are like betting on the outcome of a beauty contest. The humans who control large purchasing decisions are likely much less technical than the average audience of this blog. The fact that this is frustrating doesn't make it any less true or important to internalize.
> Peeps, let’s do some really simple back-of-envelope math. Trust me, it won’t be difficult math.
> You get the LLM to draft some code for you that’s 80% complete/correct.
> You tweak the last 20% by hand.
> How much of a productivity increase is that? Well jeepers, if you’re only doing 1/5th the work, then you are… punches buttons on calculator watch… five times as productive.
The hard part is the engineering, yes, and now we can actually focus on it. 90% of the work that software engineers do isn't novel. People aren't sitting around coding complicated algorithms that require PhDs. Most of the work is doing pretty boring stuff. And even if LLMs can only write 50% of that you're still going to get a massive boost in productivity.
Of course this isn't always going to be great. Debugging this code is probably going to suck, and even with the best engineering it'll likely accelerate big balls of mud. But it's going to enable a lot of code to get written by a lot more people and it's super exciting.
Is that exciting though? A lot of the code we've currently got is garbage. We are already slogging through a morass of code. I could be wrong, but I don't think LLMs are going to change this trend.
That is why I have never understood the harsh criticism of quality of LLM generated code. What do people think much of the models were trained on? Garbage in -> Garbage out.
And that isn't a bad thing! Now us engineers need to focus on the high level stuff, like what code to write in the first place.
To state the obvious -- only if all units of work are of equal difficulty.
Now factor in the expense of using LLMs, it’s not worth it, the math isn’t mathing. You can’t even squeeze a 2x improvement, let alone the 5-10x that has become expected in the tech industry.
Imagine if your manager announced a wonderful new productivity process: From now on a herd of excited interns will get the first stab at feature requests, and allll that the senior developers need to do is take that output and juuuust spruce it up a bit.
Then you have plenty of time for those TPS reports. (Now, later in my career, I have a lot more sympathy for Tom Smykowski, speaking with customers so the engineers don't have to.)
But wait just hear me out for a second! What if we've also sufficiently lowered the definition of "GI" at the same time...
This is quite literally what junior engineers are.
I've worked at many places filled to the brim with junior engineers. They weren't enjoyable places to work. Those businesses did not succeed. They sure did produce a lot of code though.
You can't become a senior and skip being a junior.
Skepticism aside, looking at what's been possible or similar, much like Microsoft word introduces more and more grammar related features, LLMs are reasonably decent for the use case of supplying your input to it and speeding up iterations with you at the drivers seat. Whether it's text in a word processor, or code in an editor, there is likely some meaningful improvements to not dismiss the technology.
The inverse of that, where someone wants a magic seed prompt to do everything, could prove to be a bit more challenging where existing models are trained on what they know, and steps to move beyond that where they learn appear to be a few years away.
I think it will cut both ways. I guess it depends on the type of work. I feel like expectations will fill to meet the space of the increased productivity. At least, I can see this being true in a professional sense.
If I could be five times more productive i.e., work 80% less, then I would be quite excited. However, I imagine most of us will be still be working just as much -- if not more.
Maybe I am just jaded, but I can't help but be reminded of the old saying, "The reward for digging the biggest hole is a bigger shovel."
For personal endeavors, I am nothing but excited for this technology.
Yes but there will be far less of us working.
Which, in a healthy society would actually be a good thing.
> You tweak the last 20% by hand.
> How much of a productivity increase is that? Well jeepers, if you’re only doing 1/5th the work, then you are… punches buttons on calculator watch… five times as productive
Except, it’s probably four times as hard to verify the 80% right code as it is to just write the code yourself in the first place.
The error rate needs to come down significantly from where it's currently at, and we'd also need AIs that cover each others weaknesses - at the moment you can't just run another AI to fix the 20% of errors the first one generated because they tend to have overlapping blind spots.
Having said all that, I think AI is hugely useful as a programming tool. Here are the places I've got most value:
- Better IDE auto-complete.
- Repetetive re-factors that are beyond what's possible to do with a regex. For example, replaing one function with another whose usage is subtly different. LLMs can crack out those changes across a codebase and it's easy to review.
- Summarizing and extracting intent from dense technical content. I'm often implementing some part of a web standard, and being able to quickly ask an AI questions, and have it give references back to specific parts of the standard is super useful.
- Bootstrapping tests. The hardest part of writing tests in my experience is writing the first one - there's always a bit of boilerplate that's annoying to write. The AI is pretty good at generating a smoke test and then you can use that a basis to do proper test coverage.
I'm sure there are more. The point is that it's currently incredibly valuable as a tool, but it's not yet capable of being an agent.
You don't need to. Focus on the work, not the person.
I recall a coworker screaming at a junior developer during a code review, and I set up a meeting later on to discuss his reaction. Why was he so upset? Because the junior developer was being lazy and not trying.
How did he know this?
He didn't. He used very dubious behavioral signals.
This was pre-LLM. As applicable now as it was then. If their code is shitty, let them know and leave it at that.
On the flip side, I often review code that I think has been written by an LLM. When the code is good, I don't really care if he wrote it or asked GPT to write it. As long as he's functioning well enough for the current job.
Deleted Comment
>100X isn't typical, I'd guess typical in my case is >2X, but it's happened at large scale and saved me months of work in this single instance. (Years of work overall)
GPT-4o can only handle rudimentary coding tasks. So I only ask it rudimentary questions. And it's rarely wrong or bugged. o1-2024-12-17 is much better, it can handle "mid" complexity tasks, it's usually never wrong on those, but a step above and it will output buggy code.
This is subjective and not perfect, but when you use the models more you can get a sense for where it starts falling apart and stay below that threshold. This means chunking your problem into multiple prompts, where each chunk is nearing the maximum level of complexity that the model can handle.
https://github.com/golergka/hn-blog-recommender
Basically 95% of the code is written by Aider. I still had to do QA, make high-level decisions, review code it wrote, assign tasks and tell it when to refactor stuff.
I would never use this workflow to write a program where absolute correctness is top priority, like a radiation therapy controller. But from this point on, I'll never waste time building pet projects, small web app and utilities myself.
Deleted Comment
An LLM might be able to hammer out the unit tests, boilerplate react logic, and the express backend in 5 minutes, but if it collapses on the "20%" that you need actual assistance with, then the productivity boost is significantly less than suggested.
[1]: https://en.wikipedia.org/wiki/Pareto_principle
I dont disagree but here's another spin on what you observed.
Kernighan's law says you have to be twice as intelligent/skilled when debugging the code as you were when writing it.
A LLM centric corollary would be -don't have the LLM write code at the boundary of its capabilities in your domain.
It's a little hard these days to know where that line lies with the rate of improvements but perhaps is an intuition people are still new at with LLMs.
Why would that be true in general? You think it just so happens that a new tool gives an X speedup in ability but costs an exactly equivalent X slowdown in other places?
Switching the job of programming from designing and writing code to reading and verifying code isn’t a good trade-off in terms of making programming easier. And my first speei node with AI tools has done nothing to change my mind on this point.
To write code that does work, you generally have to fully understand the goal and what code to use.
So unless you just flat-out don't know what code to use (ie. libraries, architecture, approach etc.) it's going to be no faster and probably less satisfying to fix LLM code as it would be to just do it yourself.
Where LLMs do shine in coding (in my limited experience) though is discovery. If you need to write something in a mostly unfamiliar language and ecosystem they can be great for getting off the ground. I'm still not convinced that the mental gear shifting between 'student asking questions' and 'researcher/analyst/coder mode' is worth the slightly more targeted responses vs. just looking up tutorials, but it's more useful scenario.
Deleted Comment
Dead Comment
As far as I can tell, this is a really unnecessarily long-winded article that is basically just "make sure you have good prompt context." And yes, when you're doing RAG, you need to make sure you pull the right data for your context.
Is this just a spam blog post for Sourcegraph?
Deleted Comment
Deleted Comment
> All you crazy MFs are completely overlooking the fact that software engineering exists as a discipline because you cannot EVER under any circumstances TRUST CODE.
is straight up insulting to me, because it effectively comes down to "use my product or you're a looney".
Also, while two years after this post (which should be labeled 2023) I've still barely tried to entirely offload coding to an LLM, the few times I did try have been pretty crap. I also really, really don't want to 'chat' with my codebase or editor. 'Chatting' to me feels about as slow as writing something myself, while I also don't get a good mental model 'for free'.
I am a moderately happy user of AI autocomplete (specifically Supermaven), but I only ever accept suggestions that are trivially correct to me. If it's not trivial, it might be useful as a guide of where to look in actual documentation if relevant, but just accepting it will lead me down a wrong path more often than not.
Cool, so I'm going to make five times more money? What's that you say, I'm not? So who is going to be harvesting the profits, then?
The company that pays you will get more code for less money and so they would make more profit if they are the only company taking advantage of this. But all companies will start to produce software at a lower cost so their profits per line of code will be squeezed to where they are now.
You will spend a lot less time writing boiler plate and searching docs or stack overflow. That frees you up to work on real problems and since you’ll be able to create products with less cost per line of code you could potentially compete with your own offerings.
That’s my take anyway.
My 2c is that engineer will be needed in the loop to build anything meaningful. It's really hard to shape the play-dough of the codebase having to give up so much control. I see other people who can't validate the LLM BS get themselves into the hole really quickly when using multi-file editing capabilities like the Composer feature.
One thing that kinda blew me away even though it's embarrasingly predictable is "parallel edit" workflow, where LLM can edit multiple files at the same time using a shared strategy. It almost never works currently for anything non-trivial, so the agent almost never uses it unless you ask it to (Run typescript check and use parallel edit to fix the bugs). It's also dumb, using the same auto-completion model that doesnt have access to Composer context, it feels like. But this shows how we can absorb the tax of waiting for LLM to respond, is by doing multiple things at once to many files.
And tests. Lots of tests. There's no way about it, LLM can spice things up really subtly and without tests on a big project it's impossible to fix. LLMS dont have the same fear of old files from previous ages, they can start hacking what's not broken. Cursor has a capable "code review" system with checkpoints and quick rollbacks. At this point in time we have to review most of what LLM is spewing out. On the upside, LLM can write decent tests, extract failing cases, etc. There's no excuse not to use tests. However extra care needs to be taken with LLM-generated tests, as to review their legitimacy in asserts.
I tried to write a popular table game engine without understanding game rules personally, and i had a bunch of reference material. It was difficult when I couldnt validate the assertions. I think I failed the experiment and had to learn game rules by debugging failing tests one by one. And in many cases the culprit was the fixture generated by LLM. We can used combined workflows to make smarter models like o1-pro to validate the fixtures, that typically works on well known concepts and games
I do wonder how much more sophisticated LLMs would have to be to really excel at multi-file or crosscutting whole-codebase editing. I have no way of really quantifying this but I suspect its a lot more than people think, and even then might require a substantially different approach to making software. Kind of reminds me of the people trying to make full length movies with video AI - cool trick but who wants to sit through 1.5 hrs of that? The gulf between those videos and an actual feature length film is _massive_, much like the difference between a prod codebase and a todo crud app. And it's not just about the level of detail, it's about coherence between small but important things and the overall structure.
With large projects, I've found that a good baseline is about 300-500 lines of output, and around 12 500 line files as input. Inside those limits, Cursor seems to do pretty well, but if you overstuff the context window (even within the supposedly supported limits) Claude Sonnet 3.5 tends to forget important attributes of the input.
Another critical component is to keep conversations short. Don't reuse the same conversation for more than one task, and try to engineer prompts to minimize the amount/size of edits.
I definitely agree that Cursor isn't a silver bullet, but it's one of the best tools I've found for bootstrapping large projects as a solo dev.
I will admit, I skimmed. No jury would convict.
Although he's talking up his product, it's an interesting premise. The idea is that cooking relevant context and parts from a codebase down into an LLM's context window, is a useful product for coding assistants.
A different slice of time, and a different set of opportunities -- as Yegge says, the recollection that if you hadn't poo-pooed some prototype, you'd have $130 million based on an undeniably successful model -- gives you a different instinct.
I really do hear the -- utterly valid -- criticisms of the early Web and Internet when I hear people's skepticism of AI, including that it is overhyped. Overhyped or not, betting on the Internet in 1994 would have gotten you far, far, more of an insight (and an income) than observing all its early flaws and ignoring or decrying it.
This is somewhat orthogonal to whether those criticisms are correct!
I think this is the most important take-away in evaluating the success of many technical businesses. At the end of the day businesses succeed or fail due to lots of factors that have nothing to do with the soundness of the enigneering or R&D that is at the core of the value proposition. This is an extremely frustrating fact for technically minded folks (I count myself).
Reminds me of the quote about how stock markets are like betting on the outcome of a beauty contest. The humans who control large purchasing decisions are likely much less technical than the average audience of this blog. The fact that this is frustrating doesn't make it any less true or important to internalize.