One thing I find frustrating is that management where I work has heard of 10x productivity gains. Some of those claims even come from early adopters at my work.
But that sets expectation way too high. Partly it is due to Amdahl's law: I spend only a portion of my time coding, and far more time thinking and communicating with others that are customers of my code. Even if does make the coding 10x faster (and it doesn't most of the time) overall my productivity is 10-15% better. That is nothing to sneeze at, but it isn't 10x.
Maybe it's due to a more R&D-ish nature of my current work, but for me, LLMs are delivering just as much gains in the "thinking" part as in "coding" part (I handle the "communicating" thing myself just fine for now). Using LLMs for "thinking" tasks feels similar to how mastering web search 2+ decades ago felt. Search engines enabled access to information provided you know what you're looking for; now LLMs boost that by helping you figure out what you're looking for in the first place (and then conveniently searching it for you, too). This makes trivial some tasks I previously classified as hard due to effort and uncertainty involved.
At this point I'd say about 1/3 of my web searches are done through ChatGPT o3, and I can't imagine giving it up now.
(There's also a whole psychological angle in how having LLM help sort and rubber-duck your half-baked thought makes many task seem much less daunting, and that alone makes a big difference.)
This, and if you add in a voice mode (e.g. ChatGPT's Advanced Mode), it is perfect for brainstorming.
Once I decide I want to "think a problem through with an LLM", I often start with just the voice mode. This forces me to say things out loud — which is remarkably effective (hear hear rubber duck debugging) — and it also gives me a fundamentally different way of consuming the information the LLM provides me. Instead of being delivered a massive amount of text, where some information could be wrong, I instead get a sequential system where I can stop/pause the LLM/redirect it as soon as something gets me curious or as I find problems with it said.
You would think that having this way of interacting would be limiting, as having a fast LLM output large chunks of information would let you skim through it and commit it to memory faster. Yet, for me, the combination of hearing things and, most of all, not having to consume so much potentially wrong info (what good is it to skim pointless stuff), ensures that ChatGPT's Advanced Voice mode is a great way to initially approach a problem.
After the first round with the voice mode is done, I often move to written-form brainstorming.
From time to time I use an LLM to pretend to research a topic that I had researched recently, to check how much time it would have saved me.
So far, most of the time, my impression was "I would have been so badly mislead and wouldn't even know it until too late". It would have saved me some negative time.
The only thing LLMs can consistently help me with so far is typing out mindless boilerplate, and yet it still sometimes requires manual fixing (but I do admit that it still does save effort). Anything else is hit or miss. The kind of stuff it does help researching with is usually the stuff that's easy to research without it anyway. It can sometimes shine with a gold nugget among all the mud it produces, but it's rare. The best thing is being able to describe something and ask what it's called, so you can then search for it in traditional ways.
That said, search engines have gotten significantly worse for research in the last decade or so, so the bar is lower for LLMs to be useful.
> One thing I find frustrating is that management where I work has heard of 10x productivity gains. Some of those claims even come from early adopters at my work.
Similar situation at my work, but all of the productivity claims from internal early adopters I've seen so far are based on very narrow ways of measuring productivity, and very sketchy math, to put it mildly.
> One thing I find frustrating is that management where I work has heard of 10x productivity gains.
That may also be in part because llms are not as big of an accelerant for junior devs as they are for seniors (juniors don't know what is good and bad as well).
So if you give 1 senior dev a souped up llm workflow I wouldn't be too surprised if they are as productive as 10 pre-llm juniors. Maybe even more, because a bad dev can actually produce negative productivity (stealing from the senior), in which case it's infinityx.
Even a decent junior is mostly limited to doing the low level grunt work, which llms can already do better.
Point is, I can see how jobs could be lost, legitimately.
The item lost is pipeline of talent in all of this though.
Precision machining is going through an absolute nightmare where the journeymen or master machinists are aging out of the work force. These were people who originally learned on manual machines, and upgraded to CNC over the years. The pipeline collapsed about 1997.
Now there are no apprentice machinists to replace the skills of the retiring workforce.
This will happen to software developers. Probably faster because they tend to be financially independent WAY sooner than machinists.
> overall my productivity is 10-15% better. That is nothing to sneeze at, but it isn't 10x.
It is something to sneeze at if you are 10-15% more expensive to employ due to the cost of the LLM tools. The total cost of production should always be considered, not just throughput.
It's just another tech hype wave. Reality will be somewhere between total doom and boundless utopia. But probably neither of those.
The AI thing kind of reminds me of the big push to outsource software engineers in the early 2000's. There was a ton of hype among executives about it, and it all seemed plausible on paper. But most of those initiatives ended up being huge failures, and nearly all of those jobs came back to the US.
People tend to ignore a lot of the little things that glue it all together that software engineers do. AI lacks a lot of this. Foreigners don't necessarily lack it, but language barriers, time zone differences, cultural differences, and all sorts of other things led to similar issues. Code quality and maintainability took a nosedive and a lot of the stuff produced by those outsourced shops had to be thrown in the trash.
I can already see the AI slop accumulating in the codebases I work in. It's super hard to spot a lot of these things that manage to slip through code review, because they tend to look reasonable when you're looking at a diff. The problem is all the redundant code that you're not seeing, and the weird abstractions that make no sense at all when you look at it from a higher level.
This was what I was saying to a friend the other day. I think anyone vaguely competent that is using LLMs will make the technology look far better than it is.
Management thinks the LLM is doing most of the work. Work is off shored. Oh, the quality sucks when someone without a clue is driving. We need to hire again.
On my personal projects it's easily 10x faster if not more in some circumstances.
At work where things are planned out months in advanced and I'm working with 5 different teams to figure out the right way to do things for requirements that change 8 times during development? Even just stuff with PR review and making sure other people understand it and can access it. idk sometimes it's probably break even or that 10-15%.
It just doesn't work well in some environments and what really makes it flourish (having super high quality architectural planning/designs/standardized patterns etc.) is basically just not viable at anything but the smallest startups and solo projects.
Frankly even just getting engineers to agree upon those super specificized standardized patterns is asking a ton, especially since lots of the things that help AI out are not what they are used to. As soon as you have stuff that starts deviating it can confuse the AI and makes that 10x no longer accessible. Also no one would want to review the PRs I'd make for the changes I do on my "10x" local project... Especially maintaining those standards is already hard enough on my side projects AI will naturally deviate and create noise and the challenge is constructing systems to guide that to make sure nothing deviates (since noise would lead to more noise).
I think it's mostly a rebalancing thing, if you have 1 or a couple like minded engineers who intend to do it they can get that 10x. I do not see that EVER existing in any actual corporate environment or even once you get more then like 4 people tbh.
Ai for middle management and project planning on the other hand...
I don't disagree with your assessment of the world today, but just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
> just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
You had to paste more into your prompts back then to make the output work with the rest of your codebase, because there weren't good IDEs/"agents" for it, but you've been able to get really really good code for 90% of "most" day to day SWE since at least OpenAI releasing the ChatGPT-4 API, which was a couple years ago.
Today it's a lot easier to demo low-effort "make a whole new feature or prototype" things than doing the work to make the right API calls back then, but most day to day work isn't "one shot a new prototype web app" and probably won't ever be.
I'm personally more productive than 1 or 2 years ago now because the time required to build the prompts was slower than my personal rate of writing code for a lot of things in my domain, but hardly 10x. It usually one-shots stuff wrong, and then there's a good chance that it'll take longer to chase down the errors than it would've to just write the thing - or only use it as "better autocomplete" - in the first place.
> I don't disagree with your assessment of the world today, but just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
So? It sounds like you're prodding us to make an extrapolation fallacy (I don't even grant the "10x in 12 months" point, but let's just accept the premise for the sake of argument).
Honestly, 12 months ago the base models weren't substantially worse than they are right now. Some people will argue with me endlessly on this point, and maybe they're a bit better on the margin, but I think it's pretty much true. When I look at the improvements of the last year with a cold, rational eye, they've been in two major areas:
* cost & efficiency
* UI & integration
So how do we improve from here? Cost & efficiency are the obvious lever with historical precedent: GPUs kinda suck for inference, and costs are (currently) rapidly dropping. But, maybe this won't continue -- algorithmic complexity is what it is, and barring some revolutionary change in the architecture, LLMs are exponential algorithms.
UI and integration is where most of the rest of the recent improvement has come from, and honestly, this is pretty close to saturation. All of the various AI products already look the same, and I'm certain that they'll continue to converge to a well-accepted local maxima. After that, huge gains in productivity from UX alone will not be possible. This will happen quickly -- probably in the next year or two.
Basically, unless we see a Moore's law of GPUs, I wouldn't bet on indefinite exponential improvement in AI. My bet is that, from here out, this looks like the adoption curve of any prior technology shift (e.g. mainframe -> PC, PC -> laptop, mobile, etc.) where there's a big boom, then a long, slow adoption for the masses.
Its great when they use AI to write a small app “without coding at all” over the weekend and then come in on Monday to brag about it and act baffled that tasks take engineers any time at all.
The reports from analysis of open source projects are that its something in the range of 10%-15% productivity gains... so it sounds like you're spot on
How much of the communication and meetings are because traditionally code was very expensive and slow to create? How many of those meetings might be streamlined or entirely disappear in the future? In my experience there is are a lot of process around making sure that software on schedule track and that it's doing what it is supposed to do. I think that the software lifecycle is about to be reinvented.
AI is the new uplift. Embrace and adapt, as a rift is forming (see my talk at https://ghuntley.com/six-month-recap/), in what employers seek in terms of skills from employees.
I'm happy to answer any questions folks may have. Currently AFK [2] vibecoding a brand new programming language [1].
I’m a tech lead and I have maybe 5X output now compared to everybody else under me. Quantified by scoring tickets at a team level. I also have more responsibilities outside of IC work compared to the people under me. At this point I’m asking my manager to fire people that still think llms are just toys because I’m tired of working with people with this poor mindset. A pragmatic engineer continually reevaluates what they think they know. We are at a tipping point now. I’m done arguing with people that have a poor model of reality. The rest of us are trying to compete and get shit done. This isn’t an opinion or a game. It’s business with real life consequences if you fall behind. I’ve offered to share my workflows, prompts, setup. Guess how many of these engineers have taken me up on my offer. 1-2 and the juniors or ones that are very far behind have not.
It’s funny. We fired someone with this attitude Thursday. And by this attitude I mean yours.
Not necessarily because of their attitude but because it turns out the software they were shipping was ripe with security issues. Security managed to quickly detect and handle the resulting incident. I can’t say his team were sad to see him go.
Are you the one at Ableton responsible for it ignoring the renaming of parameter names during the setState part of a Live program? Some of us are already jumping through ridiculous hoops to cover for your… mindset. There's stuff coming up that used to work and doesn't now, like in Live 12. From your response I would guess this is a trend that will hold.
We should not be having to code special 'host is Ableton Live' cases in JUCE just to get your host to work like the others.
Can you please not fire any people who are still holding your operation together?
I have to say I’m in the exact camp the author is complaining about. I’ve shipped non trivial greenfield products which I started back when it was only ChatGPT and it was shitty. I started using Claude with copying and pasting back and forth between the web chat and XCode. Then I discovered Cursor. It left me with a lot of annoying build errors, but my productivity was still at least 3x. Now that agents are better and claude 4 is out, I barely ever write code, and I don’t mind. I’ve leaned into the Architect/Manager role and direct the agent with my specialized knowledge if I need to.
I started a job at a demanding startup and it’s been several months and I have still not written a single line of code by hand. I audit everything myself before making PRs and test rigorously, but Cursor + Sonnet is just insane with their codebase. I’m convinced I’m their most productive employee and that’s not by measuring lines of code, which don’t matter; people who are experts in the codebase ask me for help with niche bugs I can narrow in on in 5-30 minutes as someone whose fresh to their domain. I had to lay off taking work away from the front end dev (which I’ve avoided my whole career) because I was stepping on his toes, fixing little problems as I saw them thanks to Claude. It’s not vibe coding - there’s a process of research and planning and perusing in careful steps, and I set the agent up for success. Domain knowledge is necessary. But I’m just so floored how anyone could not be extracting the same utility from it. It feels like there’s two articles like this every week now.
Look, the person who wrote that comment doesn't need to prove anything to you just because you're hopped up after reading a blog post that has clearly given you a temporary dopamine bump.
People who understand their domains well and are excellent written communicators can craft prompts that will do what we used to spend a week spinning up. It's self-evident to anyone in that situation, and the only thing we see when people demand "evidence" is that you aren't using the tools properly.
We don't need to prove anything because if you are working on interesting problems, even the most skeptical person will prove it to themselves in a few hours.
Same experience here, probably in a slightly different way of work (PhD student). Was extremely skeptical of LLMs, Claude Code has completely transformed the way I work.
It doesn't take away the requirements of _curation_ - that remains firmly in my camp (partially what a PhD is supposed to teach you! to be precise and reflective about why you are doing X, what do you hope to show with Y, etc -- breakdown every single step, explain those steps to someone else -- this is a tremendous soft skill, and it's even more important now because these agents do not have persistent world models / immediately forget the goal of a sequence of interactions, even with clever compaction).
If I'm on my game with precise communication, I can use CC to organize computation in a way which has never been possible before.
It's not easier than programming (if you care about quality!), but it is different, and it comes with different idioms.
I find that the code quality LLMs output is pretty bad. I end up going through so many iterations that it ends up being faster to do it myself. What I find agents actually useful for is doing large scale mechanical refractors. Instead of trying to figure out the perfect vim macro or AST rewrite script, I'll throw an agent at it.
I disagree strongly at this point. The code is generally good if the prompt was reasonable at this point but also every test possible is now being written, every ui element has the all required traits, every function has the correct documentation attached, the million little refactors to improve the codebase are being done, etc.
Someone told me ‘ai makes all the little things trivial to do’ and i agree strongly with that. Those many little things are things that together make a strong statement about quality. Our codebase has gone up in quality significantly with ai whereas we’d let the little things slide due to understaffing before.
The auditing is not quick. I prefer cursor to claude code because I can review its changes while it’s going more easily and stop and redirect it if it starts to veer off course (which is often, but the cost of doing business). Over time I still gain an understanding of the codebase that I can use to inform my prompts or redirection, so it’s not like I’m blindly asking it to do things. Yes, I do ask it to write unit tests a lot of the time. But I don’t have it spin off and just iterate until the unit tests pass — that’s a recipe for it to do what it needs to do to pass them and is counterproductive. I plan what I want the set of tests to look like and have them write functions in isolation without mentioning tests, and if tests fail I go through a process of auditing the failing code and then the tests themselves to make sure nothing was missed. It’s exactly how I would treat a coworkers code that I review. My prompts range from a few sentences to a few paragraphs, and nowadays I construct a large .md file with a checklist that we iterate on for larger refactors and projects to manage context
Please re-read the article. Especially the first list of things we don't know about you, your projects etc.
Your specific experience cannot be generalized. And speaking as the author, and who is (as written in the article) literally using these tools everyday.
> But I’m just so floored how anyone could not be extracting the same utility from it. It feels like there’s two articles like this every week now.
This is where we learn that you haven't actually read the article. Because it is very clearly stating, with links, that I am extracting value from these tools.
And the article is also very clearly not about extracting or not extracting value.
I did read the entire article before commenting and acknowledge that you are using them to some affect, but the line about 50% of the time it works 50% of the time is where I lost faith in the claims you’re making. I agree it’s very context dependent but, in the same way, you did not outline your approaches and practices in how you use AI in your workflow. The same lack of context exists on the other side of the argument.
It’s not. It’s like I used to play baseball professionally and now I’m a coach or GM building teams and yielding results. It’s a different set of skills. I’m working mostly in idea space and seeing my ideas come to life with a faster feedback loop and the toil is mostly gone
Otherwise, 99% of my code these days is LLM generated, there's a fair amount of visible commits from my opensource on my profile https://github.com/wesen .
A lot of it is more on the system side of things, although there are a fair amount of one-off webapps, now that I can do frontends that don't suck.
I’d like to, but purposefully am using a throwaway account. It’s an iOS app rated 4.5 stars on the app store and has a nice community. Mild userbase, in the hundreds.
Mean time to shipping features of various estimated difficulty. It’s subjective and not perfect, but generally speaking I need to work way less. I’ll be honest, one thing I think I could have done faster without AI was to implement CRDT-based cloud sync for a project I have going. I think I’ve tried to utilize AI too much for this. It’s good at implementing vector clock implementations, but not at preventing race conditions.
> there’s a process of research and planning and perusing in careful steps, and I set the agent up for success
Are there any good articles you can share or maybe your process? I’m really trying to get good at this but I don’t find myself great at using agents and I honestly don’t know where to start. I’ve tried the memory bank in cline, tried using more thinking directives, but I find I can’t get it to do complex things and it ends up being a time sink for me.
More anecdata: +1 for “LLMs write all my production code now”. 25+ years in industry, as expert as it’s possible to be in my domain. 100% agree LLMs fail hilariously badly, often, and dangerously. And still, write ~all my code.
No agenda here, not selling anything. Just sitting here towards the later part of my career, no need to prove anything to anyone, stating the view from a grey beard.
Crypto hype was shill from grifters pumping whatever bag holding scam they could, which was precisely what the behavioral economic incentives drove. GenAI dev is something else. I’ve watched many people working with it, your mileage will vary. But in my opinion (and it’s mine, you do you), hand coding is an apocryphal skill. The only part I wonder about is how far up and down the system/design/architecture stack the power-tooling is going to go. My intuition and empirical findings incline towards a direction I think would fuel a flame war. But I’m just grey beard Internet random, and hey look, no evidence just more baseless claims. Nothing to see here.
Disclosure: I hold no direct shares in Mag 7, nor do I work for one.
_So much_ work in the 'services' industries globally comes down to really a human transposing data from one Excel sheet to another (or from a CRM/emails to Excel), manually. Every (or nearly every) enterprise scale company will have hundreds if not thousands of FTEs doing this kind of work day in day out - often with a lot of it outsourced. I would guess that for every 1 software engineer there are 100 people doing this kind of 'manual data pipelining'.
So really for giant value to be created out of LLMs you do not need them to be incredible at OCaml. They just need to ~outperform humans on Excel. Where I do think MCP really helps is that you can connect all these systems together easily, and a lot of the errors in this kind of work came from trying to pass the entire 'task' in context. If you can take an email via MCP, extract some data out and put it into a CRM (again via MCP) a row at a time the hallucination rate is very low IME. I would say at least a junior overworked human level.
Perhaps this was the point of the article, but non-determinism is not an issue for these kind of use cases, given all the humans involved are not deterministic either. We can build systems and processes to help enforce quality on non deterministic (eg: human) systems.
Finally, I've followed crypto closely and also LLMs closely. They do not seem to be similar in terms of utility and adoption. The closest thing I can recall is smartphone adoption. A lot of my non technical friends didn't think/want a smartphone when the iPhone first came out. Within a few years, all of them have them. Similar with LLMs. Virtually all of my non technical friends use it now for incredibly varied use cases.
Making a comparison to crypto is lazy criticism. It’s not even worth validating. It’s people who want to take the negative vibe from crypto and repurpose it. The two technologies have nothing to do with each other, and therefore there’s clearly no reason to make comparative technical assessments between them.
That said, the social response is a trend of tech worship that I suspect many engineers who have been around the block are weary of. It’s easy to find unrealistic claims, the worst coming from the CEOs of AI companies.
At the same time, a LOT of people are practically computer illiterate. I can only imagine how exciting it must seem to people who have very limited exposure to even basic automation. And the whole “talking computer” we’ve all become accustomed to seeing in science fiction is pretty much becoming reality.
There’s a world of takes in there. It’s wild.
I worked in ML and NLP several years before AI. What’s most striking to me is that this is way more mainstream than anything that has ever happened in the field. And with that comes a lot of inexperience in designing with statistical inference. It’s going to be the Wild West for a while — in opinions, in successful implementation, in learning how to form realistic project ideas.
Look at it this way: now your friend with a novel app idea can be told to do it themselves. That’s at least a win for everyone.
> Look at it this way: now your friend with a novel app idea can be told to do it themselves. That’s at least a win for everyone.
For now, anyways. Thing is, that friend now also has a reasonable shot at succeeding in doing it themselves. It'll take some more time for people to fully internalize it. But let's not forget that there's a chunk of this industry that's basically building apps for people with "novel app ideas" that have some money but run out of friends to pester. LLMs are going to eat a chunk out of that business quite soon.
Each FTE doing that manual data pipelining work is also validating that work, and they have a quasi-legal responsibility to do their job correctly and on time. They may have substantial emotional investment in the company, whether survival instinct to not be fired, or ambition to overperform, or ethics and sense to report a rogue manager through alternate channels.
An LLM won't call other nodes in the organization to check when it sees that the value is unreasonable for some out-of-context reason, like yesterday was a one-time-only bank holiday and so the value should be 0. *It can be absolutely be worth an FTE salary to make sure these numbers are accurate.* And for there to be a person to blame/fire/imprison if they aren't accurate.
People are also incredibly accurate at doing this kind of manual data piping all day.
There is also a reason that these jobs are already not automated. Many of these jobs you don't need language models. We could have automated them already but it is not worth someone to sign off on. I have been in this situation at a bank. I could have automated a process rather easily but the upside for me was a smaller team and no real gain while the downside was getting fired for a massive automated mistake if something went wrong.
> An LLM won't call other nodes in the organization to check when it sees that the value is unreasonable for some out-of-context reason, like yesterday was a one-time-only bank holiday and so the value should be 0.
Why not? LLMs are the first kind of technology that can take this kind of global view. We're not making much use of it in this way just yet, but considering "out-of-context reasons" and taking a wider perspective is pretty much the defining aspect of LLMs as general-purpose AI tools. In time, I expect them to match humans on this (at least humans that care; it's not hard to match those who don't).
I do agree on the liability angle. This increasingly seems to be the main value a human brings to the table. It's not a new trend, though. See e.g. medicine, architecture, civil engineering - licensed professionals aren't doing the bulk of the work, but they're in the loop and well-compensated for verifying and signing off on the work done by less-paid technicians.
You are correct that review and validation should still be manual. But the actual "translation" from one format to another should be automated with llms
>I would guess that for every 1 software engineer there are 100 people doing this kind of 'manual data pipelining'.
For what time of company is this true? I really would like someone to just do a census of 500 white collar jobs and categorize them all. Anything that is truly automatic has already been automated away.
I do think AI will cause a lot of disruption, but very skeptical of the view that most people with white collar jobs are just "email jobs" or data entry. That doesn't fit my experience at all, and I've worked at some large bureaucratic companies that people here would claim are stuck in the past.
I'm a retired programmer. I can't imagine trusting code generated by probablities for anything mission critical. If it were close and just needed minor tweaks I could understand that. But I don't have experience with it.
My comment is mainly to say LLMs are amazing in areas that are not coding, like brainstorming, blue sky thinking, filling in research details, asking questions that make me reflect. I treat the LLM like a thinking partner. It does make mistakes, but those can be caught easily by checking other sources, or even having another LLM review the conclusions.
Well; I can't speak to your specific experience (current or past) but I'm telling you that while I'm skeptical as hell about EVERYTHING, it's blowing my expectations away in every conceivable way.
I built something in less than 24h that I'm sure would have taken us MONTHS to just get off the ground, let alone to the polished version it's at right now. The most impressive thing is that it can do all of the things that I absolutely can do, just faster. But the most impressive thing is that it can do all the things I cannot possibly do and would have had to hire up/contract out to accomplish--for far less money, time, and with faster iterations than if I had to communicate with another human being.
It's not perfect and it's incredibly frustrating at times (hardcoding values into the code when I have explicitly told it not to; outright lying that it made a particular fix, when it actually changed something else entirely unrelated), but it is a game changer IMO.
> I built something in less than 24h that I'm sure would have taken us MONTHS to just get off the ground, let alone to the polished version it's at right now
See, your comment is a good example of what's going wrong. The OP specifically mentioned "mission critical things" - My interpretation of that would be things that are not allowed to break, because otherwise people might die, in the worst case - and you were talking about just SOMETHING that got "done" faster. No mention about anything critical.
Of course, I was playing around with claude code, too, and I was fascinated how fun it can be and yes, you can get stuff done. But I have absolutely no clue what the code is doing and if there are some nasty mistakes. So it kinda worked, but I would not use that for anything "mission critical" (whatever this means).
I tried the "thinking partner" approach for a while and for a moment I thought it worked well, but at some point the cracks started to show and I called the bluff. LLMs are extremely good at creating an illusion that they know things and are capable of reasoning, but they really don't do a good job of cultivating intellectual conversation.
I think it's dangerously easy to get misled when trying to prod LLMs for knowledge, especially if it's a field you're new to. If you were using a regular search engine, you could look at the source website to determine the trustworthiness of its contents, but LLMs don't have that. The output can really be whatever, and I don't agree it's necessarily that easy to catch the mistakes.
This is very model-dependent. If you use something heavy on sycophancy and low on brain cells (like GPT-4o, the default paid ChatGPT model), you'll get lots and lots of cracks because these models are optimised for engagement.
That said, don't use model output directly. Use it to extract "shibboleth" keywords and acronyms in that domain, then search those up yourself with a classical search engine (or in a follow-up LLM query). You'll access a lot of new information that way, simply because you know how to surface it now.
You don't say what LLM you are using. I'm using ChatGPT 4o. I'm getting great results, but I review the output with a skeptical eye similar to how I read Wikipedia articles. Like Wikipedia, GPT 4o is great for surfacing new topics for research and does it quickly, which makes stream of thought easier.
Mostly agree. What works better is seeing the AI as a "high level to low level converter", in the same way Java is converted to machine code when it's ran. You describe exactly what you want it to report or do, and steer it whenever there are ambiguities. It does "grunt work" for you. With the bar of what grunt work means being moved up. Grunt work used to be doing the dishes or calculating numbers by hand on a paper spreasheet. Decades ago we automated those these. Now we've automated searching for information, summarization, implementing fully specified technical desings for software, the list goes on.
I've been programming for 40 years and have started using LLM's a few months ago, and it has really changed the way I work. I let it write pieces of code (pasting error messages from logs mostly result in a fix in less then a minute), but also brainstorming about architecture or new solutions. Of course I check the code it writes, but I'm still almost daily amazed at the intelligence and accuracy. (Very much unlike crypto).
All code, including stuff that we experienced coders write is inherently probabilistic. That’s why we have code reviews, unit tests, pair programming, guidelines and guardrails in any critical project. If you’re using LLM output uncritically, you’re doing it wrong, but if you’re using _human_ output uncritically you’re doing it wrong too.
That said, they are not magic, and my fear is that people use copilots and agentic models and all the rest to hide poor engineering practice, building more and more boilerplate instead of refactoring or redesigning for efficiency or safety or any of the things that matter in the long run.
There's one thing that I find LLM extremely good at: data science. Since the IO is well defined, you can easily verify that the output is correct. You can even ask it write tests for you given that you know certain properties of the data.
The problem is that the LLM needs context of what you are doing, contexts that you won't (or too lazy) to give in a chat with it ala ChatGPT. This is where Claude Code changes the game.
For example, you have PCAP file where each UDP packet contain multiple messages.
How do you filter the IP/port/protocol/time? Use LLM, check the output
How do you find the number of packets that have patterns A, AB, AAB, ABB.... Use LLM, check the output
How to create PCAPs that only contain those packets for testing? Use LLM, check the output
Etc etc
Since it can read your code, it is able to infer (because lets be honest, you work aint special) what you are trying to do at a much better rate. In any case, the fact that you can simply ask "Please write a unit test for all of the above functions" means that you can help it verify itself.
Treating it like a thinking partner is good. When programming, I don't treat it like a person, but rather a very high programming language. Imagine exactly how you want the code to be. Then find a way to express that unambiguously in natural language; the idea is that you'll still have a bit of work to do writing things out, but it will be a lot quicker than typing out all code by hand. Combined with iterations of feedback as well as having the AI build and run your program (at least as a sanity check); and asking the AI to check the program behaviour in the same way you would, gets you quite far.
A limitation is the lack of memory. If you steer it from style A to B using multiple points of feedback, if this is not written down, the next AI session you'll have to reexplain this all over.
Deepseek is about 1TB in weights; maybe that is why LLMs don't remember things across sessions yet. I think everybody can have their personal AI (hosted remote unless you own lots of compute); it should remember what happened yesterday; in particular the feedback it was given when developing. As an AI layman I do think this is the next step.
You don't trust the code coming out of the probabilistic machine. You build a validation cage around it with hard interfaces and you also review the output.
> Like most skeptics and critics, I use these tools daily. And 50% of the time they work 50% of the time.
I use LLMs nearly every day for my job as of about a year ago and they solve my issues about 90% of the time. I have a very hard time deciphering if these types of complaints about AI/LLMs should be taken seriously, or written off as irrational use patterns by some users. For example, I have never fed an LLM a codebase and expected it to work magic. I ask direct, specific questions at the edge of my understanding (not beyond it) and apply the solutions in a deliberate and testable manner.
if you're taking a different approach and complaining about LLMs, I'm inclined to think you're doing it wrong. And missing out on the actual magic, which is small, useful and fairly consistent.
Hmm. Ok so you're basically quoting the line from The Weatherman "60% of the time, it works all of the time."
I also use gpt and Claude daily via cursor.
Gpt o3 is kinda good for general knowledge searches. Claude falls down all the time, but I've noticed that while it's spending tokens to jerk itself off, quite often it happens on the actual issue going on with out recognizing it.
Models are dumb and more idiot than idiot savant, but sometimes they hit on relevant items. As long as you personally have an idea of what you need to happen and treat LLMs like rat terriers in a farm field, you can utilize them properly
I just went through the last 10 chat titles and all of them were spot on for me. Maybe the person you’re responding to has a different experience than you do and calling their perspective “suspect” is somewhat uncharitable.
(There are times I do other kinds of work and it fails terribly. My main point stands.)
It either helps me find a solution or it doesn't. About 90% of the time, or less formally I would just say "almost all of the time", it does. Keep in mind that I, the user, decide which questions to ask in the first place. If my batting average seems unbelievably high, perhaps my skill is in knowing when to use an LLM and when not to.
This reads like the author is mad about imprecision in the discourse which is real but to be quite frank more rampant amongst detractors than promoters, who often have to deal with the flaws and limitations on a day to day basis.
The conclusion that everything around LLMs is magical thinking seems to be fairly hubristic to me given that in the last 5 years a set of previously borderline intractable problems have become completely or near completely solved, translation, transcription, and code generation (up to some scale), for instance.
> but to be quite frank more rampant amongst detractors than promoters, who often have to deal with the flaws and limitations on a day to day basis.
"detractors" usually point to actual flaws. "promoters" usually uncritically hail LLMs as miracles capable of solving any problem in one go, without giving any specific details.
Google Translate just spits out nonsense for distant language pairs (English<->Korean etc) and doesn't compare to Sota LLMs, Whisper is a Transformer (Architecture used for LLMs) and Code Generators have nothing on LLMs.
Crypto is a lifeline for me, as I cannot open a bank account in the country I live in, for reasons I can neither control nor fix. So I am happy if crypto is useless for you. For me and for millions like me, it is a matter of life and death.
As for LLMs — once again, magic for some, reliable deterministic instrument for others (and also magic). Just classified and sorted a few hundreds of invoices. Yes, magic.
This is basically the only use case for crypto, and one for which it was explicitly designed: censorship resistance. This is why people have so much trouble finding useful things for it to do in the legal economy, it was explicitly designed to facilitate transactions the government doesn't want or can't facilitate. In some cases, there are humanitarian applications, there are also a lot of illicit applications.
I am a Russian immigrant in Switzerland. As of right now, all Swiss banks block all Russian bank accounts until their owners can provide a valid physical residence permit card, due to sweeping sanctions (meanwhile, Russian-owned companies continue to freely trade crude oil from here, as they use Swiss nominal directors — the hypocrisy is through the roof). My residence permit is on renewal now and the case is being dragged for 7 months already — so, no bank account.
I don't think you actually disagree with the authors quip. You seem to want to use crypto as a currency, while OP was most likely referring to the grifting around crypto as an investment. If you're using it as a currency, then the people trying to pump and dump coins and use it for a money making vehicle are your adversaries. You are best served if it's stable instead of a rollercoaster of booms and busts.
Stablecoins are a thing. But yes, I hate the current state of affairs, with "memecoins" and whatsnot. Particularly the government push from one particular country. We created crypto to be independent from governments, not to enable them.
Said this in another thread and I'll repeat it here:
It's the same problem that crypto experiences. Almost everyone is propagating lies about the technology, even if a majority of those doing so don't understand enough to realize they're lies (naivety vs malice).
I'd argue there's more intentional lying in crypto and less value to be gained, but in both cases people who might derive real benefit from the hard truth of the matter are turning away before they enter the door due to dishonesty/misrepresentation- and in both cases there are examples of people deriving real value today.
> I'd argue there's more intentional lying in crypto
I disagree. Crypto sounds more like intentional lying because it's primarily hyped in contexts typical for scams/gambling. Yes, there are businesses involved (anybody can start one), but they're mostly new businesses or a tiny tack-on to an existing business.
AI is largely being hyped within the existing major corporate structures, therefore its lies just get tagged as as "business as usual". That doesn't make them any less of a lie though.
Loosely related, but I find the use of AGI (and sometimes even AI) as terms annoying lately. Especially in scientific papers, where I would imagine everything to be well defined. If at least in how it is used in that paper.
So, why can't we just come up with some definition for what AGI is? We could then, say, logically prove that some AI fits that definition. Even if this doesn't seem practically useful, it's theoretically much more useful than just using that term with no meaning.
Instead it kind of feels like it's an escape hatch. On wikipedia we have "a type of ai that would match or surpass human capabilities across virtually all cognitive tasks". How could we measure that? What good is this if we can't prove that a system has this property?
Bit of a rant but I hope it's somewhat legible still.
You don't need consensus on the meaning across the board. I maintain my own, more permissive milestone for what constitutes "AGI", but I have no expectations that others will share it. Much like "crypto" to me is still cryptography, not cryptocurrency - sometimes the mainstream will just have a different opinion.
My point is, it's not about the mainstream or marketing. In science, some rigor is expected. There doesn't need to be consensus if a definition is established within some context. It's perfectly fine to redefine something as needed for the research but only if that definition is declared.
But that sets expectation way too high. Partly it is due to Amdahl's law: I spend only a portion of my time coding, and far more time thinking and communicating with others that are customers of my code. Even if does make the coding 10x faster (and it doesn't most of the time) overall my productivity is 10-15% better. That is nothing to sneeze at, but it isn't 10x.
At this point I'd say about 1/3 of my web searches are done through ChatGPT o3, and I can't imagine giving it up now.
(There's also a whole psychological angle in how having LLM help sort and rubber-duck your half-baked thought makes many task seem much less daunting, and that alone makes a big difference.)
Once I decide I want to "think a problem through with an LLM", I often start with just the voice mode. This forces me to say things out loud — which is remarkably effective (hear hear rubber duck debugging) — and it also gives me a fundamentally different way of consuming the information the LLM provides me. Instead of being delivered a massive amount of text, where some information could be wrong, I instead get a sequential system where I can stop/pause the LLM/redirect it as soon as something gets me curious or as I find problems with it said.
You would think that having this way of interacting would be limiting, as having a fast LLM output large chunks of information would let you skim through it and commit it to memory faster. Yet, for me, the combination of hearing things and, most of all, not having to consume so much potentially wrong info (what good is it to skim pointless stuff), ensures that ChatGPT's Advanced Voice mode is a great way to initially approach a problem.
After the first round with the voice mode is done, I often move to written-form brainstorming.
So far, most of the time, my impression was "I would have been so badly mislead and wouldn't even know it until too late". It would have saved me some negative time.
The only thing LLMs can consistently help me with so far is typing out mindless boilerplate, and yet it still sometimes requires manual fixing (but I do admit that it still does save effort). Anything else is hit or miss. The kind of stuff it does help researching with is usually the stuff that's easy to research without it anyway. It can sometimes shine with a gold nugget among all the mud it produces, but it's rare. The best thing is being able to describe something and ask what it's called, so you can then search for it in traditional ways.
That said, search engines have gotten significantly worse for research in the last decade or so, so the bar is lower for LLMs to be useful.
Similar situation at my work, but all of the productivity claims from internal early adopters I've seen so far are based on very narrow ways of measuring productivity, and very sketchy math, to put it mildly.
That may also be in part because llms are not as big of an accelerant for junior devs as they are for seniors (juniors don't know what is good and bad as well).
So if you give 1 senior dev a souped up llm workflow I wouldn't be too surprised if they are as productive as 10 pre-llm juniors. Maybe even more, because a bad dev can actually produce negative productivity (stealing from the senior), in which case it's infinityx.
Even a decent junior is mostly limited to doing the low level grunt work, which llms can already do better.
Point is, I can see how jobs could be lost, legitimately.
Precision machining is going through an absolute nightmare where the journeymen or master machinists are aging out of the work force. These were people who originally learned on manual machines, and upgraded to CNC over the years. The pipeline collapsed about 1997.
Now there are no apprentice machinists to replace the skills of the retiring workforce.
This will happen to software developers. Probably faster because they tend to be financially independent WAY sooner than machinists.
It is something to sneeze at if you are 10-15% more expensive to employ due to the cost of the LLM tools. The total cost of production should always be considered, not just throughput.
Claude Max is $200/month, or ~2% of the salary of an average software engineer.
How is one spending anywhere close to 10% of total compensation on LLMs?
The AI thing kind of reminds me of the big push to outsource software engineers in the early 2000's. There was a ton of hype among executives about it, and it all seemed plausible on paper. But most of those initiatives ended up being huge failures, and nearly all of those jobs came back to the US.
People tend to ignore a lot of the little things that glue it all together that software engineers do. AI lacks a lot of this. Foreigners don't necessarily lack it, but language barriers, time zone differences, cultural differences, and all sorts of other things led to similar issues. Code quality and maintainability took a nosedive and a lot of the stuff produced by those outsourced shops had to be thrown in the trash.
I can already see the AI slop accumulating in the codebases I work in. It's super hard to spot a lot of these things that manage to slip through code review, because they tend to look reasonable when you're looking at a diff. The problem is all the redundant code that you're not seeing, and the weird abstractions that make no sense at all when you look at it from a higher level.
Management thinks the LLM is doing most of the work. Work is off shored. Oh, the quality sucks when someone without a clue is driving. We need to hire again.
Frankly even just getting engineers to agree upon those super specificized standardized patterns is asking a ton, especially since lots of the things that help AI out are not what they are used to. As soon as you have stuff that starts deviating it can confuse the AI and makes that 10x no longer accessible. Also no one would want to review the PRs I'd make for the changes I do on my "10x" local project... Especially maintaining those standards is already hard enough on my side projects AI will naturally deviate and create noise and the challenge is constructing systems to guide that to make sure nothing deviates (since noise would lead to more noise).
I think it's mostly a rebalancing thing, if you have 1 or a couple like minded engineers who intend to do it they can get that 10x. I do not see that EVER existing in any actual corporate environment or even once you get more then like 4 people tbh.
Ai for middle management and project planning on the other hand...
You had to paste more into your prompts back then to make the output work with the rest of your codebase, because there weren't good IDEs/"agents" for it, but you've been able to get really really good code for 90% of "most" day to day SWE since at least OpenAI releasing the ChatGPT-4 API, which was a couple years ago.
Today it's a lot easier to demo low-effort "make a whole new feature or prototype" things than doing the work to make the right API calls back then, but most day to day work isn't "one shot a new prototype web app" and probably won't ever be.
I'm personally more productive than 1 or 2 years ago now because the time required to build the prompts was slower than my personal rate of writing code for a lot of things in my domain, but hardly 10x. It usually one-shots stuff wrong, and then there's a good chance that it'll take longer to chase down the errors than it would've to just write the thing - or only use it as "better autocomplete" - in the first place.
So? It sounds like you're prodding us to make an extrapolation fallacy (I don't even grant the "10x in 12 months" point, but let's just accept the premise for the sake of argument).
Honestly, 12 months ago the base models weren't substantially worse than they are right now. Some people will argue with me endlessly on this point, and maybe they're a bit better on the margin, but I think it's pretty much true. When I look at the improvements of the last year with a cold, rational eye, they've been in two major areas:
So how do we improve from here? Cost & efficiency are the obvious lever with historical precedent: GPUs kinda suck for inference, and costs are (currently) rapidly dropping. But, maybe this won't continue -- algorithmic complexity is what it is, and barring some revolutionary change in the architecture, LLMs are exponential algorithms.UI and integration is where most of the rest of the recent improvement has come from, and honestly, this is pretty close to saturation. All of the various AI products already look the same, and I'm certain that they'll continue to converge to a well-accepted local maxima. After that, huge gains in productivity from UX alone will not be possible. This will happen quickly -- probably in the next year or two.
Basically, unless we see a Moore's law of GPUs, I wouldn't bet on indefinite exponential improvement in AI. My bet is that, from here out, this looks like the adoption curve of any prior technology shift (e.g. mainframe -> PC, PC -> laptop, mobile, etc.) where there's a big boom, then a long, slow adoption for the masses.
Your developers still push a mouse around to get work done? Fire them.
AI is the new uplift. Embrace and adapt, as a rift is forming (see my talk at https://ghuntley.com/six-month-recap/), in what employers seek in terms of skills from employees.
I'm happy to answer any questions folks may have. Currently AFK [2] vibecoding a brand new programming language [1].
[1] https://x.com/GeoffreyHuntley/status/1940964118565212606 [2] https://youtu.be/e7i4JEi_8sk?t=29722
That would be a 70% descent?
Not necessarily because of their attitude but because it turns out the software they were shipping was ripe with security issues. Security managed to quickly detect and handle the resulting incident. I can’t say his team were sad to see him go.
We should not be having to code special 'host is Ableton Live' cases in JUCE just to get your host to work like the others.
Can you please not fire any people who are still holding your operation together?
Everyone else who raises any doubts about LLMs is an idiot and you're 10,000x better than everyone else and all your co-workers should be fired.
But what's absent from all your comments is what you make. Can you tell us what you actually do in your >500k job?
Are you, by any chance, a front-end developer?
Also, a team-lead that can't fire their subordinates isn't a team-lead, they're a number two.
isn't this the entire LLM experience?
I started a job at a demanding startup and it’s been several months and I have still not written a single line of code by hand. I audit everything myself before making PRs and test rigorously, but Cursor + Sonnet is just insane with their codebase. I’m convinced I’m their most productive employee and that’s not by measuring lines of code, which don’t matter; people who are experts in the codebase ask me for help with niche bugs I can narrow in on in 5-30 minutes as someone whose fresh to their domain. I had to lay off taking work away from the front end dev (which I’ve avoided my whole career) because I was stepping on his toes, fixing little problems as I saw them thanks to Claude. It’s not vibe coding - there’s a process of research and planning and perusing in careful steps, and I set the agent up for success. Domain knowledge is necessary. But I’m just so floored how anyone could not be extracting the same utility from it. It feels like there’s two articles like this every week now.
You didn't share any evidence with us even though you claim unbelievable things.
You even went as far as registering a throwavay account to hide your identity and to make verifying any of your claims impossible.
Your comment feels more like a joke to me
Look, the person who wrote that comment doesn't need to prove anything to you just because you're hopped up after reading a blog post that has clearly given you a temporary dopamine bump.
People who understand their domains well and are excellent written communicators can craft prompts that will do what we used to spend a week spinning up. It's self-evident to anyone in that situation, and the only thing we see when people demand "evidence" is that you aren't using the tools properly.
We don't need to prove anything because if you are working on interesting problems, even the most skeptical person will prove it to themselves in a few hours.
It doesn't take away the requirements of _curation_ - that remains firmly in my camp (partially what a PhD is supposed to teach you! to be precise and reflective about why you are doing X, what do you hope to show with Y, etc -- breakdown every single step, explain those steps to someone else -- this is a tremendous soft skill, and it's even more important now because these agents do not have persistent world models / immediately forget the goal of a sequence of interactions, even with clever compaction).
If I'm on my game with precise communication, I can use CC to organize computation in a way which has never been possible before.
It's not easier than programming (if you care about quality!), but it is different, and it comes with different idioms.
Someone told me ‘ai makes all the little things trivial to do’ and i agree strongly with that. Those many little things are things that together make a strong statement about quality. Our codebase has gone up in quality significantly with ai whereas we’d let the little things slide due to understaffing before.
That was my experience with Cursor, but Claude Code is a different world. What specific product/models brought you to this generalization?
How do you audit code from an untrusted source that quickly, LLMs do not have the whole project in their heads and are proned to hallucinate.
On average how long are your prompts and does the LLM also write the unit tests?
I personally think you’re sugar coating the experience.
The person you're responding to literally said, "I audit everything myself before making PRs and test rigorously".
Your specific experience cannot be generalized. And speaking as the author, and who is (as written in the article) literally using these tools everyday.
> But I’m just so floored how anyone could not be extracting the same utility from it. It feels like there’s two articles like this every week now.
This is where we learn that you haven't actually read the article. Because it is very clearly stating, with links, that I am extracting value from these tools.
And the article is also very clearly not about extracting or not extracting value.
Damn, this sounds pretty boring.
Links please
This is _far_ from web crud.
Otherwise, 99% of my code these days is LLM generated, there's a fair amount of visible commits from my opensource on my profile https://github.com/wesen .
A lot of it is more on the system side of things, although there are a fair amount of one-off webapps, now that I can do frontends that don't suck.
How do you measure this?
Are there any good articles you can share or maybe your process? I’m really trying to get good at this but I don’t find myself great at using agents and I honestly don’t know where to start. I’ve tried the memory bank in cline, tried using more thinking directives, but I find I can’t get it to do complex things and it ends up being a time sink for me.
A bit suspicious, wouldn’t you agree?
No agenda here, not selling anything. Just sitting here towards the later part of my career, no need to prove anything to anyone, stating the view from a grey beard.
Crypto hype was shill from grifters pumping whatever bag holding scam they could, which was precisely what the behavioral economic incentives drove. GenAI dev is something else. I’ve watched many people working with it, your mileage will vary. But in my opinion (and it’s mine, you do you), hand coding is an apocryphal skill. The only part I wonder about is how far up and down the system/design/architecture stack the power-tooling is going to go. My intuition and empirical findings incline towards a direction I think would fuel a flame war. But I’m just grey beard Internet random, and hey look, no evidence just more baseless claims. Nothing to see here.
Disclosure: I hold no direct shares in Mag 7, nor do I work for one.
_So much_ work in the 'services' industries globally comes down to really a human transposing data from one Excel sheet to another (or from a CRM/emails to Excel), manually. Every (or nearly every) enterprise scale company will have hundreds if not thousands of FTEs doing this kind of work day in day out - often with a lot of it outsourced. I would guess that for every 1 software engineer there are 100 people doing this kind of 'manual data pipelining'.
So really for giant value to be created out of LLMs you do not need them to be incredible at OCaml. They just need to ~outperform humans on Excel. Where I do think MCP really helps is that you can connect all these systems together easily, and a lot of the errors in this kind of work came from trying to pass the entire 'task' in context. If you can take an email via MCP, extract some data out and put it into a CRM (again via MCP) a row at a time the hallucination rate is very low IME. I would say at least a junior overworked human level.
Perhaps this was the point of the article, but non-determinism is not an issue for these kind of use cases, given all the humans involved are not deterministic either. We can build systems and processes to help enforce quality on non deterministic (eg: human) systems.
Finally, I've followed crypto closely and also LLMs closely. They do not seem to be similar in terms of utility and adoption. The closest thing I can recall is smartphone adoption. A lot of my non technical friends didn't think/want a smartphone when the iPhone first came out. Within a few years, all of them have them. Similar with LLMs. Virtually all of my non technical friends use it now for incredibly varied use cases.
That said, the social response is a trend of tech worship that I suspect many engineers who have been around the block are weary of. It’s easy to find unrealistic claims, the worst coming from the CEOs of AI companies.
At the same time, a LOT of people are practically computer illiterate. I can only imagine how exciting it must seem to people who have very limited exposure to even basic automation. And the whole “talking computer” we’ve all become accustomed to seeing in science fiction is pretty much becoming reality.
There’s a world of takes in there. It’s wild.
I worked in ML and NLP several years before AI. What’s most striking to me is that this is way more mainstream than anything that has ever happened in the field. And with that comes a lot of inexperience in designing with statistical inference. It’s going to be the Wild West for a while — in opinions, in successful implementation, in learning how to form realistic project ideas.
Look at it this way: now your friend with a novel app idea can be told to do it themselves. That’s at least a win for everyone.
For now, anyways. Thing is, that friend now also has a reasonable shot at succeeding in doing it themselves. It'll take some more time for people to fully internalize it. But let's not forget that there's a chunk of this industry that's basically building apps for people with "novel app ideas" that have some money but run out of friends to pester. LLMs are going to eat a chunk out of that business quite soon.
ultimately, crypto is information science. mathematically, cryptography, compression, and so on (data transmission) are all the "same" problem.
LLMs compress knowledge, not just data, and they do it in a lossy way.
traditional information science work is all about dealing with lossless data in a highly lossy world.
An LLM won't call other nodes in the organization to check when it sees that the value is unreasonable for some out-of-context reason, like yesterday was a one-time-only bank holiday and so the value should be 0. *It can be absolutely be worth an FTE salary to make sure these numbers are accurate.* And for there to be a person to blame/fire/imprison if they aren't accurate.
There is also a reason that these jobs are already not automated. Many of these jobs you don't need language models. We could have automated them already but it is not worth someone to sign off on. I have been in this situation at a bank. I could have automated a process rather easily but the upside for me was a smaller team and no real gain while the downside was getting fired for a massive automated mistake if something went wrong.
Why not? LLMs are the first kind of technology that can take this kind of global view. We're not making much use of it in this way just yet, but considering "out-of-context reasons" and taking a wider perspective is pretty much the defining aspect of LLMs as general-purpose AI tools. In time, I expect them to match humans on this (at least humans that care; it's not hard to match those who don't).
I do agree on the liability angle. This increasingly seems to be the main value a human brings to the table. It's not a new trend, though. See e.g. medicine, architecture, civil engineering - licensed professionals aren't doing the bulk of the work, but they're in the loop and well-compensated for verifying and signing off on the work done by less-paid technicians.
For what time of company is this true? I really would like someone to just do a census of 500 white collar jobs and categorize them all. Anything that is truly automatic has already been automated away.
I do think AI will cause a lot of disruption, but very skeptical of the view that most people with white collar jobs are just "email jobs" or data entry. That doesn't fit my experience at all, and I've worked at some large bureaucratic companies that people here would claim are stuck in the past.
My comment is mainly to say LLMs are amazing in areas that are not coding, like brainstorming, blue sky thinking, filling in research details, asking questions that make me reflect. I treat the LLM like a thinking partner. It does make mistakes, but those can be caught easily by checking other sources, or even having another LLM review the conclusions.
I built something in less than 24h that I'm sure would have taken us MONTHS to just get off the ground, let alone to the polished version it's at right now. The most impressive thing is that it can do all of the things that I absolutely can do, just faster. But the most impressive thing is that it can do all the things I cannot possibly do and would have had to hire up/contract out to accomplish--for far less money, time, and with faster iterations than if I had to communicate with another human being.
It's not perfect and it's incredibly frustrating at times (hardcoding values into the code when I have explicitly told it not to; outright lying that it made a particular fix, when it actually changed something else entirely unrelated), but it is a game changer IMO.
Would love to see it!
Of course, I was playing around with claude code, too, and I was fascinated how fun it can be and yes, you can get stuff done. But I have absolutely no clue what the code is doing and if there are some nasty mistakes. So it kinda worked, but I would not use that for anything "mission critical" (whatever this means).
I think it's dangerously easy to get misled when trying to prod LLMs for knowledge, especially if it's a field you're new to. If you were using a regular search engine, you could look at the source website to determine the trustworthiness of its contents, but LLMs don't have that. The output can really be whatever, and I don't agree it's necessarily that easy to catch the mistakes.
That said, don't use model output directly. Use it to extract "shibboleth" keywords and acronyms in that domain, then search those up yourself with a classical search engine (or in a follow-up LLM query). You'll access a lot of new information that way, simply because you know how to surface it now.
All code, including stuff that we experienced coders write is inherently probabilistic. That’s why we have code reviews, unit tests, pair programming, guidelines and guardrails in any critical project. If you’re using LLM output uncritically, you’re doing it wrong, but if you’re using _human_ output uncritically you’re doing it wrong too.
That said, they are not magic, and my fear is that people use copilots and agentic models and all the rest to hide poor engineering practice, building more and more boilerplate instead of refactoring or redesigning for efficiency or safety or any of the things that matter in the long run.
The problem is that the LLM needs context of what you are doing, contexts that you won't (or too lazy) to give in a chat with it ala ChatGPT. This is where Claude Code changes the game.
For example, you have PCAP file where each UDP packet contain multiple messages.
How do you filter the IP/port/protocol/time? Use LLM, check the output
How do you find the number of packets that have patterns A, AB, AAB, ABB.... Use LLM, check the output
How to create PCAPs that only contain those packets for testing? Use LLM, check the output
Etc etc
Since it can read your code, it is able to infer (because lets be honest, you work aint special) what you are trying to do at a much better rate. In any case, the fact that you can simply ask "Please write a unit test for all of the above functions" means that you can help it verify itself.
A limitation is the lack of memory. If you steer it from style A to B using multiple points of feedback, if this is not written down, the next AI session you'll have to reexplain this all over.
Deepseek is about 1TB in weights; maybe that is why LLMs don't remember things across sessions yet. I think everybody can have their personal AI (hosted remote unless you own lots of compute); it should remember what happened yesterday; in particular the feedback it was given when developing. As an AI layman I do think this is the next step.
Deleted Comment
Deleted Comment
I use LLMs nearly every day for my job as of about a year ago and they solve my issues about 90% of the time. I have a very hard time deciphering if these types of complaints about AI/LLMs should be taken seriously, or written off as irrational use patterns by some users. For example, I have never fed an LLM a codebase and expected it to work magic. I ask direct, specific questions at the edge of my understanding (not beyond it) and apply the solutions in a deliberate and testable manner.
if you're taking a different approach and complaining about LLMs, I'm inclined to think you're doing it wrong. And missing out on the actual magic, which is small, useful and fairly consistent.
I also use gpt and Claude daily via cursor.
Gpt o3 is kinda good for general knowledge searches. Claude falls down all the time, but I've noticed that while it's spending tokens to jerk itself off, quite often it happens on the actual issue going on with out recognizing it.
Models are dumb and more idiot than idiot savant, but sometimes they hit on relevant items. As long as you personally have an idea of what you need to happen and treat LLMs like rat terriers in a farm field, you can utilize them properly
"90%" also seems a bit suspect.
(There are times I do other kinds of work and it fails terribly. My main point stands.)
The conclusion that everything around LLMs is magical thinking seems to be fairly hubristic to me given that in the last 5 years a set of previously borderline intractable problems have become completely or near completely solved, translation, transcription, and code generation (up to some scale), for instance.
"detractors" usually point to actual flaws. "promoters" usually uncritically hail LLMs as miracles capable of solving any problem in one go, without giving any specific details.
Google Translate, Whisper and Code Generators (up to some scale) have existed for quite some time without using LLMs.
Crypto is a lifeline for me, as I cannot open a bank account in the country I live in, for reasons I can neither control nor fix. So I am happy if crypto is useless for you. For me and for millions like me, it is a matter of life and death.
As for LLMs — once again, magic for some, reliable deterministic instrument for others (and also magic). Just classified and sorted a few hundreds of invoices. Yes, magic.
"You had to be there to believe it" https://x.com/0xbags/status/1940774543553146956
AI craze is currently going through a similar period: any criticism is brushed away as being presented by morons who know nothing
It's the same problem that crypto experiences. Almost everyone is propagating lies about the technology, even if a majority of those doing so don't understand enough to realize they're lies (naivety vs malice).
I'd argue there's more intentional lying in crypto and less value to be gained, but in both cases people who might derive real benefit from the hard truth of the matter are turning away before they enter the door due to dishonesty/misrepresentation- and in both cases there are examples of people deriving real value today.
I disagree. Crypto sounds more like intentional lying because it's primarily hyped in contexts typical for scams/gambling. Yes, there are businesses involved (anybody can start one), but they're mostly new businesses or a tiny tack-on to an existing business.
AI is largely being hyped within the existing major corporate structures, therefore its lies just get tagged as as "business as usual". That doesn't make them any less of a lie though.
So, why can't we just come up with some definition for what AGI is? We could then, say, logically prove that some AI fits that definition. Even if this doesn't seem practically useful, it's theoretically much more useful than just using that term with no meaning.
Instead it kind of feels like it's an escape hatch. On wikipedia we have "a type of ai that would match or surpass human capabilities across virtually all cognitive tasks". How could we measure that? What good is this if we can't prove that a system has this property?
Bit of a rant but I hope it's somewhat legible still.
"AI is whatever hasn't been done yet."[1]
1. https://en.wikipedia.org/wiki/AI_effect