About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy. And I guess that in tech many folks try LLMs for the same use cases. Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.
But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question. Drop your files in the context window; ask very precise questions explaining the background. They work great to explore what is at the borders of your knowledge. They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours). The best LLMs (in my case just Claude Sonnet 3.5, I must admit) out there are able to accelerate you.
I'm surprised at the description that it's "useless" as a programming / design partner. Even if it doesn't make "elegant" code (whatever that means), it's the difference between an app existing at all, or not.
I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.
I wouldn't describe myself as a programmer, and didn't plan to ever build an app, mostly because in the attempts I made, I'd get stuck and couldn't google my way out.
LLMs are the great un-stickers. For that reason per se, they are incredibly useful.
The context here is super-important - the commenter is the author of Redis. So, a super-experienced and productive low-level programmer. It’s not surprising that Staff-plus experts find LLMs much less useful.
Though I’d be interested if this was an opinion on “help me write this gnarly C algorithm” or “help me to be productive in <new language>” as I find a big productivity increase from the latter.
Off topic, but I'm a bit confused. Your iOS apps as listed on your website are CarPrep and Brocly, neither of which appear to have notable review activity or buzz in the media. If the app you're referring to is one of these, the more interesting question (to me) is: how on Earth are you generating $10,200 MRR from it? Or is there another app that I'm missing?
(In my experience as an app developer, getting any traction and/or money from your app can be much more difficult than actually building it.)
> I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.
My experience is that people who claim they build worthwhile software "exclusively" using LLMs are lying. I don't know you and I don't know if you are lying, but I would be willing to bet my paycheck you are.
I interpreted it as saying that ymmv wrt the models you try and how you use them, and sole exposure to one that doesn't work for you can put you off the whole lot - in this case antirez finds Claude sonnet (with good prompting) very helpful, but gpt 4o (by far the best known due to ChatGPT), not so much and if the latter is representative of others experience it may be why many are still sceptical.
I tried exactly that, a simple Todo-like app, without SwiftUI or Swift knowledge, and Sonnet 3.5 only gave me one syntax error after another. Now I‘m watching Paul Hudson‘s intro videos.
I think a lot of the confusion is in how we approach LLMs. Perhaps stemming from the over-broad term “AI”.
There are certain classes of problems that LLMs are good at. Accurately regurgitating all accumulated world knowledge ever is not one, so don’t ask a language model to diagnose your medical condition or choose a political candidate.
But do ask them to perform suitable tasks for a language model! Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report. It’s super cool!
There are lots of cases like this where you can give an LLM reliable data and ask it to do a language related task and it will do an excellent job of it.
If nothing else it’s an extremely useful computer-human interface.
> Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report.
not to dissuade you from a thing you find useful but are you aware that the national weather service produces an Area Forecast Discussion product in each local NWS office daily or more often that accomplishes this with human meteorologists and clickable jargon glossary?
> so don’t ask a language model to diagnose your medical condition
(o1-preview) LLMs show promise in clinical reasoning but fall short in probabilistic tasks, underscoring why AI shouldn't replace doctors for diagnosis just yet.
"Superhuman performance of a large language model on the reasoning tasks of a physician" https://arxiv.org/abs/2412.10849 [14 Dec 2024]
I actually found 4o+search to be really good at this... Admittedly what I did was more "research these candidates, tell me anything newsworthy, pros/cons, etc" (much longer prompt) and well, it was way faster/patient at finding sources than I ever would've been, telling me things I never would've figured out with <5 minutes of googling each set of candidates (which is what I've done before).
Honestly my big rule for what LLMs are good at is stuff like "hard/tedious/annoying to do, easy to verify" and maybe a little more than that. (I think after using a model for a while you can get a "feel" for when it's likely BSing.)
>don’t ask a language model to diagnose your medical condition
Honestly they are very decent at it if you give them accurate information in which to make the diagnosis. The typical problem people have is being unable to feed accurate information to the model. They'll cut out parts they don't want to think about or not put full test results in for consideration.
> Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report. It’s super cool!
You feed it a weather report and it responds with a weather report? How is that useful?
I don't think people finding LLMs useless is a good representation of the general sentiment though. I feel that more than anything, people are annoyed at LLM slop. Someone uses an LLM too much to write code, they create "slop," which ends up making things worse.
Unfortunately complex tools will be misused by part of the population. There is no easy escape from that in the modernity of possibilities. Look at the Internet itself.
> But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question.
People keep saying this, and there are use cases for which this is definitely the case, but I find the opposite to be just as true in some circumstances.
I'm surprised at how good LLMs are at answering "me be monkey, me have big problem with code" questions. For simple one-offs like "how to do x in Pandas" (a frequent one for me), I often just give Claude a mish-mash of keywords, and it usually figures out what I want.
An example prompt of mine from yesterday, which Claude successfully answered, was "python sha256 of file contents base64 safe for fs path."
With a system prompt to make Claude's output super brief and a command to execute queries from the terminal via Simon Willison's LLM tool, this is extremely useful.
> About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy....
and
> a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability.
I still hold that the innovations we've seen as an industry with text transfer to the data from other domains. And there's an odd misbehavior with people that I've now seen play out twice -- back in 2017 with vision models (please don't shove a picture of a spectrogram into an object detector), and today. People are trying to coerce text models to do stuff with data series, or (again!) pictures of charts, rather than paying attention to timeseries foundation models which directly can work on the data.[1]
Further, the tricks we're seeing with encoder / decoder pipelines should work for other domains. And we're not yet recognizing that as an industry. For example, whisper or the emerging video models are getting there, but think about multi-spectral satellite data, fraud detection (a type graph problem).
There's lots of value to unlock from coding models. They're just text models. So what if you were to shove an abstract syntax tree in as the data representation, or the intermediate code from LLVM or a JVM or whatever runtime and interact with that?
> It's a bit sad and confusing that LLMs ("Large Language Models") have little to do with language; It's just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something.
> They don't care if the tokens happen to represent little text chunks. It could just as well be little image patches, audio chunks, action choices, molecules, or whatever. If you can reduce your problem to that of modeling token streams (for any arbitrary vocabulary of some set of discrete tokens), you can "throw an LLM at it".
The environmental arguments are hilarious to me as a diehard crypto guy. The ultimate answer to “waste” of electricity arguments is that energy is a free market and people pay the price if it’s useful for them. As long as the activity isn’t illegal then training LLMs or mining bitcoins, it doesn’t matter. I pay for the electricity I use.
I'm a big believer in Claude. I've accomplished some huge productivity gains by leveraging it. That said, I can see places where the models are strong and weak. If you're doing react, or python. These models are incredible. C#, C++ they're not terrible. Rust though, it's not great. If your experience is exclusively trying to use it to write Rust, it doesn't matter if you're using o1, Claude or anything else. It's just not great at it yet.
> Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.
It's not as helpful as Google was ten years ago. It's more helpful than Google today, because Google search has slowly been corrupted by garbage SEO and other LLM spam, including their own suggestions.
Claude Sonnet 3.5 can write whole React applications with proper contextual clues and some minor iterations. Google has never coded for you.
I’ve written two large applications and about a dozen smaller ones using Claude as an assistant.
I’m a terrible front-end developer and almost none of that work was possible without Claude. The API and AWS deployment were sped up tremendously.
I’ve created unit tests and I’ve read through the resulting code and it’s very clean. One of my core pre-prompt requirements has always been to follow domain-driven design principles, something a novice would never understand.
I also start with design principles and a checklist that Claude is excellent at providing.
My only complaint is you only have a 3-4 hour window before you’re cutoff for a few hours.
And needing an enterprise agreement to have a walled garden for proprietary purposes.
I was not a fan in Q1. Q2 improved. Q3 was a massive leap forward.
> ask very precise questions explaining the background
IME, being forced to write about something or verbally explaining/enumerating things in detail _by itself_ leads to a lot of clarity in the writer's thoughts, irrespective of if there's an LLM answering back.
People have been doing rubber-duck-debugging since long. The metaphorical duck (LLMs in our context), if explained to well, has now started answering back with useful stuff!
One thing LLMs have been incredibly strong even since gpt-3.5 is being the most advanced non-human rubber duck, and while they can do plenty more, that alone provides (me at least) with tremendous utility.
> About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy. And I guess that in tech many folks try LLMs for the same use cases. Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.
I see much deeper problems. Just to give two examples:
- I asked various AIs concerning explanations of proofs of some deep (established) mathematical theorems: the explanations were to my understanding very hallucinated, and thus worse than "obviously wrong". I also asked for literature references for some deep mathematical theory frameworks: bascially all of the references were again hallucinated.
- I asked lots of AIs on https://lmarena.ai/ to write a suitably long text about some political topic that is quite controversial in my country (but does have lots proponents even in a very radical formulation, even though most people would not use such a radical formulation in public). All of the LLMs that I checked refused or tried to indoctrinate me that this thesis is wrong. I did not ask the LLM to lecture me, but I gave it a concrete task! Society is deeply divided, so if the LLM only spreads propaganda of its political teaching, it will be useless for many tasks for a very significant share of the society.
Both new Sonnet and Haiku have a masking overhead.
Using a few messages to get them out of "I aim to be direct" AI assistant mode gets much better overall results for the rest of the chat.
Haiku is actually incredibly good at high level systems thinking. Somehow when they moved to a smaller model the "human-like" parts fell away but the logical parts remained at a similar level.
Like if you were taking meeting notes from a business strategy meeting and wanted insights, use Haiku over Sonnet, and thank me later.
To get the most out of them you have to provide context. Treat these models like some kind of eager beaver junior engineer who wants to jump in and write code without asking questions. Force it to ask questions (eg: “do not write code yet, please restate my requirements to make sure we are in alignment. Are there any extra bits of context or information that would help? I will tell you when to write code”)
If your model / chat app has the ability to always inject some kind of pre-prompt make sure to add something like “please do not jump to writing code. If this was a coding interview and you jumped to writing code without asking questions and clarifying requirements you’d fail”.
At the top of all your source files include a comment with the file name and path. If you have a project on one of these services add an artifact that is the directory tree (“tree —-gitignore” is my goto). This helps “unaided” chats get a sense of what documents they are looking at.
And also, it’s a professional bullshitter so don’t trust it with large scale code changes that rely on some language / library feature you don’t have personal experience with. It can send you down a path where the entire assumption that something was possible turns out to be false.
Does it seek like a lot of work? Yes. Am I actually more productive with the tool than without? Probably. But it sure as shit isn’t “free” in terms of time spent providing context. I think the more I use these models, the more I get a sense of what it is good at and what is going to be a waste of time.
Long story short, prompting is everything. These things aren’t mind readers (and worse they forget everything in each new session)
Super interesting that my experience mirrors exactly what you are writing... except for me finding Claude to be almost useless (often misunderstands me, gives answers that are plain wrong) and 4o to be a very helpful, if not somewhat dull, jack-of-all trades in helping me be a cruise control for the mind.
I could only ever really jam with 4o.
Makes me wonder if there's personal communication preferences at play here.
Probably. But statistically to work with 4o is a lose of time for me. LLMs is like an investment: you write the prompts, you "work" with them. If the LLM is too weak, this is a lose of time. You need to have a return on the investment that is positive. With ChatGPT 4o / o1 most of the times for me the investment of time has almost zero return. Before Claude Sonnet 3.5 I already had a ChatGPT PRO account but never used it for coding since it was most of the times useless if not for throw away scripts that I didn't want to do myself or as a stack overflow replacement for trivial stuff. Now it's different.
Like what? Claude has become my go-to, but I find that it's wrong enough often enough that I really can't trust it for anything. If it says something I have to go dig through it's citations very carefully.
A very big surprise is just how much better Sonnet 3.5 is than Haiku. Even the confusingly-more-expensive-Haiku-variant Haiku 3.5 that's more recent than Sonnet 3.5 is still much worse.
I ponder if LLM:s are very useful but at a quite narrower set of tasks than we expect. Like fuzzy manipulation of logical specifications.
I.e. over time it constitute a fundamental shift in how we interact with abstractions in computers. The current fundamentals will still remain but they will become increasingly malleable. Details in code will become less important. Architecture will become increasingly important. But at the same time the cost of refactoring or changing architecture will quickly drop.
Any details that are easily lost when passing through an LLM will be details that have the highest maintenance cost. Any important details that can be retained by an LLM can move up and down the ladder of abstraction at will.
Can an LLM based solution maintain software architectures without introducing noise? The answer to that is the difference between somewhat useful and game changing.
Most people consider their own brain useless and don't use it, so it's not strange that they do the same with AI. How many people just refuse to learn how to parallel park, a new language, calculus or even basic arithmetic, "because they aren't good at it".
LLMs have given computers the ability to communicate with us in natural language, we didn't have that before at this level. In order to do this, they've been fed with a lot of coherent stuff and give the impression of being coherent, but we know they're just statistical machines. But at least they can now communicate naturally with us, so now we have that infrastructure available, as we do have TTS or ASR or monitors and keyboards available. It's still up to us to now make proper agents out of them. Agents for the software we've been using for decades. They can take over a lot of tedious work for us.
Why are you pasting huge chunks of potentially crown jewels code into a 3rd party service where prompts are going to most likely be turned into training/surveillance material?
>They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours)
All the tasks I can think of dealing with on my own computer that would take hours, a) are actually pretty interesting to me and b) would equally well take hours to "provide perfect guidance". The drudge work of programming that I notice comes in blocks of seconds at a time, and the mental context switch to using an LLM would be costlier.
I swear these goalposts keep getting moved, I remember being told that GPT3.5 is a useless toy but the paid GPT4 is lifechanging, and now that GPT4 is free I'm told that it's a useless toy but paid o1 or paid Sonnet are lifechanging. Looking forward to o1 and Sonnet becoming useless toys, unlike the lifechanging o3.
Why do people have such narrow views on what makes LLMs useful? I use them for basically everything.
My son throwing an irrational tantrum at the amusement park and I can't figure out why he's like that (he won't tell me or he doesn't know himself either) or what I should do? I feed Claude all the facts of what happened that day and ask for advice. Even if I don't agree with the advice, at the very least the analysis helps me understand/hypothesize what's going on with him. Sure beats having to wait until Monday to call up professionals. And in my experience, those professionals don't do a better job of giving me advice than Claude does.
It's weekend, my wife is sick, the general practitioner is closed, the emergency weekend line has 35 people in the queue, and I want some quick half-assed medical guidance that while I know might not be 100% reliable, is still better than nothing for the next 2 hours? Feed all the symptoms and facts to Claude/ChatGPT and it does an okay job a lot of the time.
I've been visiting Traditional Chinese Medicine (TCM) practitioner for a week now and my symptoms are indeed reducing. But TCM paradigm and concepts are so different from western medicine paradigms and concepts that I can't understand the doctor's explanation at all. Again, Claude does a reasonable job of explaining to me what's going on or why it works from a western medicine point of view.
Want to write a novel? Brainstorm ideas with GPT-4o.
I had a debate with a friend's child over the correct spelling of a Dutch word ("instabiel" vs "onstabiel"). Google results were not very clear. ChatGPT explained it clearly.
Just where is this "useless" idea coming from? Do people not have a life outside of coding?
Yes people have lives outside of coding, but most people are able to manage without having AI software intercede in as much of their lives as possible.
It seems like you trust AI more than people and prefer it to direct human interaction. That seems to be satisfying a need for you that most people don't have.
At the risk of sounding impolite or critical of your personal choices: this, right here, is the problem!
You don’t understand how medicine works, at any level.
Yet you turn to a machine for advice, and take it at face value.
I say these things confidently, because I do understand medicine well enough to not to seek my own answers. Recently I went to a doctor for a serious condition and every notion I had was wrong. Provably wrong!
I see the same behaviour in junior developers that simply copy-paste in whatever they see in StackOverflow or whatever they got out of ChatGPT with a terrible prompt, no context, and no understanding on their part of the suitability of the answer.
This is why I and many others still consider AIs mostly useless. The human in the loop is still the critical element. Replace the human with someone that thinks that powdered rhino horn will give them erections, and the utility of the AI drops to near zero. Worse, it can multiply bad tendencies and bad ideas.
I’m sure someone somewhere is asking DeepSeek how best to get endangered animals parts on the black market.
I believe it's more frustration directed at the mismatch between marketing and reality, combined with the general well deserved growing hatred for SV culture, and, more broadly, software engineers. The sentiment would be completely different if the entire industry marketed themselves like the helpful tools they are rather than the second coming of Christ they aren't. This distinction is hard to make on "fast food" forums like this one.
If you aren't a coder, it's hard to find much utility in "Google, but it burns a tree whenever you make an API call, and everything it tells you might be wrong". I for one have never used it for anything else. It just hasn't ever come up.
It's great at cheating on homework, kids love GPTs. It's great at cheating in general, in interviews for instance. Or at ruining Christmas, after this year's LLM debacle it's unclear if we'll have another edition of Advent of Code. None of this is the technology's fault, of course, you could say the same about the Internet, phones or what have you, but it's hardly a point in favor either.
And if you are a coder, models like Claude actually do help you, but you have to monitor their output and thoroughly test whatever comes out of them, a far cry from the promises of complete automation and insane productivity gains.
If you are only a consumer of this technology, like the vast majority of us here, there isn't that much of an upside in being an early adopter. I'll sit and wait, slowly integrating new technology in my workflow if and when it makes sense to do so.
> there isn't that much of an upside in being an early adopter.
Other than, y'know, using the new tools. As a programmer heavy forum, we focus a lot on LLMs' (lack of) correctness. There's more than a little bit of annoyance when things are wrong, like being asked to grab the red blanket and then getting into an argument over it being orange instead of what was important, someone needed the blanket because they were cold.
Most of the non-tech people who use ChatGPT that I've talked to absolutely love it because they don't feel it judges them for asking stupid questions and they have conversations about absolutely everything in their lives with it down to which outfit to wear to the party. There are wrong answers to that question as well, but they're far more subjective and just having another opinion in the room is invaluable. It's just a computer and won't get hurt if you totally ignore it's recommendations, and even better, it won't gloat (unless you ask it to) if you tell it later that it was right and you were wrong.
Some people have found upsides for themselves in their lives, even at this nascent stage. No one's forcing you to use one, but your job isn't going to be taken by AI, it's going to be taken by someone else who can outperform you that's using AI.
> They work great to explore what is at the borders of your knowledge.
But not at exploring what is at the border of knowledge itself. And by converging on the conventional, LLMs actually lead you away from anything that actually extends.
> doing boring tasks for which you can provide perfect guidance
That's true but you never need an LLM for that. There are wonderful scripts written by wonderful people and provided for free almost all the time and for those who search in the right places. LLM companies benefit/profit of these without providing anything in return.
They are worse than people who grab FOSS and turn it into overpriced and aggressively marketed business models and services or people who threaten and sue FOSS for being better and free alternatives to their bloated and often "illegally telemetric" services.
> able to accelerate you
True, but you leave too much for data brokers and companies like Meta to abuse and exploit in the future. All that additional "interactional data" will do so much worse to humanity than all those previous data sets did in elections, for example, or pretty much all consumer markets. They will mostly accelerate all these dimwitted Fortune 5000 companies that have sabotaged consumers into way too much dumb shit - way more than is reasonable or "ok". And educated, wealthy and or tech-savvy people won't be able to avoid/evade any of that. Especially when it's paired with meds, drugs, foods, biases, fallacies, priming and so on and all the knowledge we will gain on bio-chemical pathways and human liability to sabotage.
They are great for coders, of course, everyone can be an army of clone-warriors with auto-complete on steroids now and nobody can tell you what to do with all that time that you now have and all that money, which, thanks to all of us but mostly our ancestors, is the default. The problem is the resulting hyper-amplified, augmented financial imbalance. It's gonna fuck our species if all the technical people don't restore some of that balance, and everybody knows what that means and what must be done.
Is there a way to use this in Jetbrains IDEs? (I've not been impressed with their AI Assistant.) There are a few plugins, but from the reviews they all seem kind of mediocre.
I personally use the Zed editor AI assistant integration with Sonnet for anything AI-related, while using a JetBrains IDE for coding / code reading, side-by-side.
I haven’t found anything comparably good for JetBrains IDEs yet, but I’m also not switching to something else as my main editor.
Github copilot plugin is decent. It's not going to write a whole app for you, but it accelerates repetitive stuff, can give suggestions you didn't think of or save you a trip to the documentation.
I use IntelliJ as my main coding tool but also use VSCode and Sublime text. If you have access to local LLMs or have an API key for some the Continue Plugin (basically Cursor but can use in IntelliJ) is the Best of the Best for IntelliJ (IMO). I have a box running some local models including Phind and StarCoder (plus some small embeddings) and have been super happy with the end product. The next up is Google Gemini Code Assist has been the best of the IntelliJ (non-configured) AI tools I have tried. There are better ones out there but IMO not for IntelliJ. It's still free for a few more weeks and I have been using it since the free release, fun to use. Can pre-prompt, say you are an expert XXX, please be funny, fill in the rest of your regular prompts. The Co-Pilot I use for work is very limited and will only answer coding questions. I tried to tell it that it was my coding buddy, and its name was Phil and told me it cannot have a personality or be funny. I believe the paid personal Co-Pilot allows you to choose which LLM it uses (I cannot confirm). The Phind VSCode plugin works really well. Also, the Phind coding models are on par with some of the other big ones and free if you have a subscription (or run locally). Sublime is around to open those GIG+ files as VSCode chocks and not worth the RAM of opening another IntelliJ.
Each task / programming language / query requires trying different LLM models and novel ways of prompting. If it's not work-related (or work pays for the one you use) sending as much of the code as relevant also helps the answers be more useful.
Most of the people I meet that say LLMs are not useful have only tried one (flavor / plugin), do not know how to pre-prompt or prompt, and do not give the tools a chance. Try one or two things, say yep, it's not good and give up.
Still hard for me to admit that Prompt Engineering is a profession, but it's the same as Google Fu. Once you learn it you can become an LLM Ninja!
I do not believe LLMs are coming for my job (just yet) but do believe they are going to be able to replace some people, are useful and those that do not use them will be at a disadvantage.
Exactly, and right now the LLMs acceleration effect is a tool, not "give me the final solution". Even people that can't code, using LLMs to build applications from scratch, still have this tool mindset. This is why they can use them effectively: they don't stop at the first failed solution; they provide hints to the LLM, test the code, try to figure what's the problem (also with the LLM help), and so forth. It's a matter of mindset.
I’m surprised you only have one use case. I use LLMs to research travel, adjust recipes, check biographies and book reviews, and many many more things.
...Begun November 4, 2024, published December 28, 2024.
...assisted by Claude 3.5 sonnet, trained on my previous books...
...puzzles co-created by the author and Claude
...GPT-4o and -o1 were useful in latex configurations...doing proof-reading.
...Gemini Experimental 1206 was an especially good proof-reader
...Exercises were generated with the help of Claude and may have errors.
...project was impossible without the creative labors of Claude
The obvious comparison is to the classic Strang https://math.mit.edu/~gs/everyone/ which took several *years* to conceptualize, write, peer review, revise and publish.
Ok maybe Strang isn't your cup of tea, :%s/Strang/Halmos/g , :%s/Strang/Lipschutz/g, :%s/Strang/Hefferon/g, :%s/Strang/Larson/g ...
Working through the exercises in this new LLMbook, I'm thinking...maybe this isn't going to stand the test of time. Maybe acceleration is not so hot after all.
"The story of linear algebra begins with systems of equations,
each line describing a constraint or boundary traced upon abstract space.
These simplest mathematical models of limitation — each equation binding variables in measured proportion — conjoin to shape the realm of
possible solutions. When several such constraints act in concert, their collaboration yields three possible fates: no solution survives their collective
force; exactly one point satisfies all bounds; or infinite possibilities trace
curves and planes through the space of satisfaction. This trichotomy — of
emptiness, uniqueness, and infinity — echoes through all of linear algebra, appearing in increasingly sophisticated forms as our understanding
deepens."
Maybe I'm not the target audience, but... that really doesn't make me interested in continuing to read.
^ This perfectly encapsulates the story I see every time someone digs into the details of any llm generated or assisted content that has any level of complexity.
Great on the surface but lacks any depth, cohesive, or substance
I started a book about CIAM (customer identity and access management) using Claude to help outline a chapter. I'd edit and refine the outline to make sure it covered everything.
Then I'd have Claude create text. I'd then edit/refine each chapter's text.
Wow, was it unpleasant. It was kinda cool to see all the words put together, but editing the output was a slog.
It's bad enough editing your own writing, but for some reason this was even worse.
just to clarify - I have nothing to do with this book. I was just forwarded a copy and I thought its relevant to the topic at hand.
from the wild swings in karma, looks like people are annoyed with the message and shooting down the messenger.
We're at the "computers play chess badly" stage. Then we'll hit the Deep Thought (1988) and Deep Blue (1995-1997) stages, but still saying that solving Go won't happen for 50+ years and that humans will continue to be better than computers.
The date/time that divides my world into before/after is AlphaGo v Lee Sedol game 3 (2016). From that time forward, I don't dismiss out of hand speculations of how soon we can have intelligent machines. Ray Kurzweil date of 2045 is as good as any (and better than most) for an estimate. Like Moore's (and related) Laws, it's not about how but the historical pace of advancements crossing a fairly static point of human capability.
Application coding, requires much less intelligence than playing Go at these high levels. The main differences are concise representation and clear final outcome scoring. LLMs deal quite well with the fuzziness of human communications. There may be a few more pegs to place but when seems predictably unknown.
> There’s a flipside to this too: a lot of better informed people have sworn off LLMs entirely because they can’t see how anyone could benefit from a tool with so many flaws. The key skill in getting the most out of LLMs is learning to work with tech that is both inherently unreliable and incredibly powerful at the same time. This is a decidedly non-obvious skill to acquire!
I wish the author qualified this more. How does one develop that skill?
What makes LLMs so powerful on a day to day basis without a large RAG system around it?
Personally, I try LLMs every now and then, but haven’t seen any indication of their usefulness for my day to day outside of being a smarter auto complete.
When I started my career in 2010, google was a semi-serious skill. All of the little things that we know how to do now such as ignoring certain sites, lingering on others, and iteratively refining our search queries were not universally known at the time. Experienced engineers often relied on encyclopedic knowledge of their environment or by "reading the manual".
In my experience, LLM tools are the same, you ask for something basic initially and then iteratively refine the query either via dialog or a new prompt until you get what you are looking for or hit the end of the LLM's capability. Knowing when you've reached the latter is critically important.
One difference is that skillful googling still only involved typing a few keywords or a short phrase and some syntax, and then knowing how to skim the results and iterate, and how to operate your browser efficiently. With LLMs, you have to type a lot more (and/or use voice input), and often also read more, it’s also not stateless/repeatable like following a web link, and most output looks the same (as opposed to the variations in web sites). I pride(d) myself on my Google foo, it was fun, but I find using LLMs to be quite exhausting in comparison.
* Most existing LLM interfaces are very bad at editing history, instead focusing entirely on appending to history. You can sort of ignore this for one-shot, and this can be properly fixed with additional custom tools, but ...
* By the time you refine your input enough to patch over all the errors in the LLM's output for your sensible input, you're bigger than the LLM can actually handle (much smaller than the alleged context window), so it starts randomly ignoring significant chunks of what you wrote (unlike context-window problems, the ignored parts can be anywhere in the input).
One of the things I find most frustrating about LLMs is how resistant they are to teaching other people how to use them!
I'd love to figure this out. I've written more about them than most people at this point, and my goal has always been to help people learn what they can and cannot do - but distilling that down to a concise set of lessons continues to defeat me.
The only way to really get to grips with them is to use them, a lot. You need to try things that fail, and other things that work, and build up an intuition about their strengths and weaknesses.
The problem with intuition is it's really hard to download that into someone else's head.
My first stab at trying ChatGPT last year was asking it to write some Rust code to do audio processing. It was not a happy experience. I stepped back and didn't play with LLMs at all for a while after that. Reading your posts has helped me keep tabs on the state of the art and decide to jump back in (though with different/easier problems this time).
It's really important to go and read the code that the author of this article actually produces with LLMs. He posted on hacker news a few months ago, a post called something like "everything I've made with ChatGPT in the month of September" or something. He's producing little toy applications that don't even begin to resemble real production code. He thinks these "tools" are useful because they help him write pointless slop.
The point of that post isn't "look at these incredible projects I've built (proceeds to show simple projects)."
It's "I built 14 small and useful tools in a single week, each taking between 2 and 10 minutes".
The thing that's interesting here is that I can have an LLM kick out a working prototype of a small, useful tool in only a little more time than it takes to run a Google search.
That post isn't meant to be about writing "real production code". I don't know why people are confused over that.
I think most tech folks struggle with it because they treat LLMs as computer programs, and their experience is that SW should be extremely reliable - imagine using a calculator that was wrong 5% of the time - no one would accept that!
Instead, think of an LLM as the equivalent of giving a human a menial task. You know that they're not 100% reliable, and so you give them only tasks that you can quickly verify and correct.
Abstract that out a bit further, and realize that most managers don't expect their reports to be 100% reliable.
Don't use LLMs where accuracy is paramount. Use it to automate away tedious stuff. Examples for me:
Cleaning up speech recognition. I use a traditional voice recognition tool to transcribe, and then have GPT clean it up. I've tried voice recognition tools for dictation on and off for over a decade, and always gave up because even a 95% accuracy is a pain to clean up. But now, I route the output to GPT automatically. It still has issues, but I now often go paragraphs before I have to correct anything. For personal notes, I mostly don't even bother checking its accuracy - I do it only when dictating things others will look at.
And then add embellishments to that. I was dictating out a recipe I needed to send to someone. I told GPT up front to write any number that appears next to an ingredient as a numeral (i.e. 3 instead of "three"). Did a great job - didn't need to correct anything.
And then there are always the "I could do this myself but I didn't have time so I gave it to GPT" category. I was giving a presentation that involved graphs (nodes, edges, etc). I was on a tight deadline and didn't want to figure out how to draw graphs. So I made a tabular representation of my graph, gave it to GPT, and asked it to write graphviz code to make that graph. It did it perfectly (correct nodes and edges, too!)
Sure, if I had time, I'd go learn graphviz myself. But I wouldn't have. The chances I'll need graphviz again in the next few years is virtually 0.
I've actually used LLMs to do quick reformatting of data a few times. You just have to be careful that you can verify the output quickly. If it's a long table, then don't use LLMs for this.
Another example: I have a custom note taking tool. It's just for me. For convenience, I also made an HTML export. Wouldn't it be great if it automatically made alt text for each image I have in my notes? I would just need to send it to the LLM and get the text. It's fractions of a cent per image! The current services are a lot more accurate at image recognition than I need them to be for this purpose!
Oh, and then of course, having it write Bash scripts and CSS for me :-) (not a frontend developer - I've learned CSS in the past, but it's quicker to verify whatever it throws at me than Google it).
Any time you have a task and lament "Oh, this is likely easy, but I just don't have the time" consider how you could make an LLM do it.
A similar use case for me - I wrote some technical documentation for our wiki about a somewhat complicated relationship between ids in some database tables. I copied my text explanation into an LLM and asked it to make a diagram and it did so. Took very little time from me and it was fast/easy to verify that the quality was good.
I think there’s the added reason that a lot of folks went into tech because (consciously or unconsciously) they prefer dealing with predictable machines than with unreliable humans. And now that career choice begins to look like a bait and switch. ;)
> Instead, think of an LLM as the equivalent of giving a human a menial task. You know that they're not 100% reliable, and so you give them only tasks that you can quickly verify and correct.
The problem is: for the tasks that I can give the LLM (or human) that I can easily verify and correct, the LLM fails with the majority of them, for example
- programming tasks of my area of expertise (which is more "mathematical" than what is common in SV startups), where I know how a high-level solution has to look like, and where I can ask the LLM to explain the gory details to me. Yes, these gory details are subtle (which is why the task can be menial), but the code has to be right. I can verify this, and the code is not correct.
- getting literature references about more obscure scientific (in particular mathematical) topics. I can easily check whether these literature references (or summaries of these references) are hallucinations - they typically are.
My experience is that for certain tasks LLMs are great, for certain tasks LLMS are basically useless.
The best prompts though are always written in a separate text file for me and pasted in. Follow up questions are never as good as a detailed initial prompt.
I would imagine well formulated questions to solve the problem at hand is a skill but beyond that I don't think there is anything special about how to ask LLMs a question.
In areas the LLM is rather useless, no amount of variation in prompting can solve that problem IMO. Just like if the tasks is something the LLM is good at, the prompt can be pretty sloppy and seem like magic with how it can understand what you want.
There's a similar dynamic in building reliable distributed systems on top of an unreliable network. The parts are prone to failure but the system can keep on working.
The tricky problem with LLMs is identifying failures - if you're asking the question, it's implied that you don't have enough context to assess whether it's a hallucination or a good recommendation! One approach is to build ensembles of agents that can check each other's work, but that's a resource-intensive solution.
It's amazing this is still an opinion in 2025. I now ask devs how they use AI as part of their workflows when I interview. It's a standard skill I expect my guys to have.
I concur that asking devs how they use AI is a great idea.
Recently, I shared a code base with a junior dev and she was surprised with the speed and sophistication of the code. The LLM did 80+% of the "coding".
What was telling was as she was grokking the code (for helping the ~20%), she was surprised at the quality of the code - her use of the LLM did not yield code of similar quality.
I find that the more domain awareness one brings to the table, the better the output is. Basically the clearer one's vision of the end-state, the better the output.
One other positive side-effect of using "LLMs as a junior-dev" for me has been that my ambitions are greater. I want it all - better code, more sophisticated capabilities even for relatively not-important projects, documentation, tests, debug-ability. And once the basic structure is in place, many a time it is trivial to get the rest.
It's never 100%, but even with 80+%, I am faster than ever before, deliver better quality code, and can switch domains multiple times a week and never feel drained.
Sharing best AI hacks within a team will have the same effect as code-reviews do in ensuring consistency. Perhaps an "LLM chat review", especially when something particularly novel was accomplished!
I would characterize good prompting as: write out your whole problem you're trying to solve, then think to yourself what the clarifying questions would be if you were a junior trying to solve it. Better yet - ask the LLM to ask you challenging clarifying questions for several rounds. Then, take all that information and re-compile it back into a list of all the important components of the project, and re-read it to make sure there's no particular ambiguous part or weird part that would be over-emphasized by the language you used. Then, emphasize the core concerns again, and tell it how you'd like it to output the response (keeping in mind that it will always do best with a conversation-style format with loose restrictions). Never let a conversation stray too long from the original goals lest it start forgetting.
Once that's all done, you basically have a well-structured question you could pass to an underling and have them completely independently work on the project without bugging you. That's the goal. Now, pass that to o1 or Claude, depending on whether it's a general-purpose task (o1) or a code-specific task (Claude), and wait for response. From there, have a conversation or test-and-followup of whatever it spits out, this time with you asking questions. If good enough, done. If not, wrap up whatever useful insights from that line of questioning and put it back into the initial prompt and either re-post it at the end of the conversation or start a fresh conversation.
I find 90% of the time this gets exactly what I'm after eventually. The few other cases are usually because we hit some cycle where the AI doesn't fully know what to change/respond, and it keeps repeating itself when I ask. The trick then is to ask things a different way or emphasize something new. This is usually just a code-specific issue, for general problems it's much better. One other trick is to ask it to take a step back and just tackle the problem in a theoretical/philosophical way first before trying to do any coding or practical solving, and then do that in a second phase (asking o1 to architect code structure and then Claude to implement it is a great combo too). Also if there is any way to break up the problem into smaller pieces which can be tackled one conversation at a time - much better. Just remember to include all relevant context it needs to interface with the overall problem too.
That sounds like a lot, but it's essentially just project management and delegation to somewhat-flawed underlings. The upside is instead of waiting a workweek for them to get back to you, you just have to wait 20 seconds. But it does mean a ton of reading and writing. There are certainly already some meta-prompts where you can get the AI to essentially do this whole process for you and assess itself, but like all automation that means extra ways for things to break too. Let the AI devs cook though and those will be a lot more commonplace soon enough...
Great summary of highlights. Don't agree with all, but I think it's a very sound attempt at a year in review summary
>LLM prices crashed
This one has me a little spooked. The white knight on this front (DS) has both announced increases and has had staff poached. There is still Gemini free tier which is ofc basically impossible to beat (solid & functionally unlimited/free) but it's google so reluctant to trust.
Seriously worried about seeing a regression on pricing in first half of 2025. Especially with the OAI $200 price anchoring.
>“Agents” still haven’t really happened yet
Think that's largely because it's a poorly defined concept and true "agent" implies some sort of pseudo-agi autonomy. This is a definition/expectation issue rather than technical in my mind
>LLMs somehow got even harder to use
I don't think that's 100%. An explosion of options is not equal to harder to use. And the guidance for noobs is still pretty much same as always (llama.cp or one of the common frontends like text-generation-webui). It's become harder to tell what is good, but not to get going.
----
One key theme I think is missing is just how hard it has become to tell what is "good" for the average user. There is so much benchmark shenanigans going on that it's just impossible to tell. I'm literally at the "I'm just going to build my own testing framework" stage. Not because I can do better technically (I can't)...but because I can gear it towards things I care about and I can be confident my DIY sample hasn't been gamed.
The biggest reason I'm not worried about prices going back up again is Llama. The Llama 3 models are really good, and because they are open weight there are a growing number of API providers competing to provide access to them.
These companies are incentivized to figure out fast and efficient hosting for the models. They don't need to train any models themselves, their value is added entirely in continuing to drive the price of inference down.
Groq and Cerberus are particularly interesting here because WOW they serve Llama fast.
That's an indication that most business-sized models won't need some giant data center. This is going to be a cheap technology most of the time.
OpenAI is thus way overvalued.
Most of the laptops that the models can run on today are in the high end of dedicated bare metal servers. Most shared VM servers are way below these laptops. Most people buying a new laptop today won't be able to run them, most devs getting a website up with a server won't be able to run them.
This means that the definitions of "laptop" and "server" are dependent on use. We should instead talk about RAM, GPU and CPU speed which is more useful and informative but less engaging than "my laptop".
I don't think openai's valuation comes from a data center bet -- rather, I'd suppose, investors think it has a first-mover advantage on model quality that it can (maybe?) attract some buy-out interest or otherwise use in yet-to-be-specified product lines.
However, it has been clear for a long time that meta are just demolishing any competitor's moats, driving the whole megacorp AI competition to razor thin margins.
It's a very welcome strategy from a consumer pov, but -- it has to be said -- genius from a business pov. By deciding that no one will win, it can prevent anyone leapfrogging them at a relatively cheap price.
The last OpenAI valuation I read about was 157 billion. I am struggling to understand what justifies this. To me, it feels like OpenAI is at best few months ahead of competitors in some areas. But even if I am underestimating the advantage, it's few years instead of few months, why does it matter? It's not like AI companies are going to enjoy the first-mover advantage internet giants had over the competition.
It's justified if AGI is possible. If AGI is possible, then the entire human economy stops making sense as far as money goes, and 'owning' part of OpenAI gives you power.
That is of course, assuming AGI is possible and exponential, and that marketshare goes to a single entity instead of a set of entities. Lots of big assumptions. Seems like we're heading towards a slow-lackluster singularity though.
People are buying shares at $x because they believe they will be able to sell them for more later. I don’t think there’s a whole to more to it than that.
Us skeptics believe that valuation prices in some form of regulatory capture or other non-market factor.
The non-skeptical interpretation is that it's a threshold function, a flat-out race with an unambiguous finish line. If someone actually hit self-improving AGI first there's an argument that no one would ever catch up.
Been in the Mac ecosystem since 2008, love it, but there is, and always has been, a tendency to talk about inevitabilities from scaling bespoke, extremely expensive configurations, and with LLMs, there's heavy eliding of what the user experience is, beyond noting response generation speed in tokens/s.
They run on a laptop, yes - you might squeeze up to 10 token/sec out of a kinda sorta GPT-4 if you paid $5K plus for an Apple laptop in the last 18 months.
And that's after you spent 2 minutes watching 1000 token* prompt prefill at 10 tokens/sec.
Usually it'd be obvious this'd trickle down, things always do, right?
But...Apple infamously has been stuck on 8GB of RAM in even $1500 base models for years. I have 0 idea why, but my intuition is RAM was ~doubling capacity at same cost every 3 years till early 2010s, then it mostly stalled out post 2015.
And regardless of any of the above, this absolutely melts your battery. Like, your 16 hr battery life becomes 40 minutes, no exaggeration.
I don't know why prefill (loading in your prompt) is so slow for local LLMs, but it is. I assume if you have a bunch of servers there's some caching you can do that works across all prompts.
I expect the local LLM community to be roughly the same size it is today 5 years from now.
* ~3 pages / ~750 words; what I expect is a conservative average for prompt size when coding
This seems like a non-sequitur unless you’re assuming something about the amount that people use models.
Most web servers can run some number of QPS on a developer laptop, but AWS is a big business, because there are a heck of a lot of QPS across all the servers.
Unless the best models themselves are costly/hard to produce, and there is not a company providing them to people free of charge AND for commercial use.
Simon has mentioned in multiple articles how cool it is to use 64GB DRAM for GPU tasks on his MacBook. I agree it's cool, but I don't understand why it is remarkable. Is Apple doing something special with DRAM that other hardware manufacturers haven't figured out? Assuming data centers are hoovering up nearly all the world's RAM manufacturing capacity, how is Apple still managing to ship machines with DRAM that performs close enough for Simon's needs to VRAM? Is this just a temporary blip, and PC manufacturers in 2025 will be catching up and shipping mini PCs that have 64GB RAM ceilings with similar memory performance? What gives?
llama.cpp can run LLMs on CPU. iGPU can also use system memory, the novel thing is not that, it's that the LLM inference is mostly memory bandwidth bound and memory bandwidth of a custom built PC with really fast DDR5 RAM is around 100GB/s, nVidia consumer GPUs at the top end are around 1TB/s, with mid range GPUs at around half that. M1 Max has 400GB/s, M1 Ultra is 800GB/s, but you can have Apple Silicon Macs with up to 192GB of 800GB/s memory usable by GPU, this means much faster inference than just CPU+system memory due to bandwidth and more affordable than building a multi-GPU system to match the memory amount.
Apple uses HBM, basically RAM on the same die as the CPU. It has a lot more memory bandwidth than typically PC dram, but still less than many GPUs. (Although the highest end macs have bandwidth that is in the same ballpark as GPUs)
Apple designs its own chips, so the RAM and CPU are on the same die and can talk at very high speeds. This is not the case for PCs, where RAM is connected externally.
> I find the term “agents” extremely frustrating. It lacks a single, clear and widely understood meaning... but the people who use the term never seem to acknowledge that.
This 100%. “Agentic” especially as a buzzword can piss off
That's one of the more common definitions people use - especially people who aren't directly building agents, since the builders tend to get more hung up on "LLM with access to tools" or similar.
My problem is when people use that definition (or any other) without clarifying, because they assume it's THE obvious definition.
Nice overview. The challenge ahead for “AI” companies is that it appears there’s really no technical moat here. Someone comes out with something amazing and new and within months (if not weeks or days) it’s quickly copied. That environment where everything quickly becomes a commodity is a recipe for many/most companies in this space to quickly get washed out as it becomes economically unviable to play in such an environment.
The money is still flowing, for now, to subsidize that fiasco but as soon as that starts to slow, even just a bit, things are gonna get bumpy real quick. Super excited about this tech but there are dark storm clouds building on the horizon and absent a major “moat” breakthrough it’s gonna get rough soon.
Not necessarily. The playbook of what tends to happen is first a bunch of players go bust in the race to the bottom, then the survivors are free to raise prices a bit when others realize there’s not much point in entering a race to the bottom. Those left then let quality slip as competition cools.
That’s exactly what happened with rideshare companies. It was an amazing new thing but subsidized in an unsustainable way, then a bunch of companies exited the space when it was an commoditized race to the bottom and those left let quality slip. Now when you order an Uber a car shows up that smells bad and has wheels about to fall off. The consumer experience was a lot better when Uber was a VC subsidized bonanza
But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question. Drop your files in the context window; ask very precise questions explaining the background. They work great to explore what is at the borders of your knowledge. They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours). The best LLMs (in my case just Claude Sonnet 3.5, I must admit) out there are able to accelerate you.
I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.
I wouldn't describe myself as a programmer, and didn't plan to ever build an app, mostly because in the attempts I made, I'd get stuck and couldn't google my way out.
LLMs are the great un-stickers. For that reason per se, they are incredibly useful.
Though I’d be interested if this was an opinion on “help me write this gnarly C algorithm” or “help me to be productive in <new language>” as I find a big productivity increase from the latter.
Tragically - admitting ignorance, even with the desire to learn, often has negative social reprocussions
(In my experience as an app developer, getting any traction and/or money from your app can be much more difficult than actually building it.)
My experience is that people who claim they build worthwhile software "exclusively" using LLMs are lying. I don't know you and I don't know if you are lying, but I would be willing to bet my paycheck you are.
That's great, but professional programmers are afraid of the future maintenance burden.
Not just the development of the code but the entire the thing from the code, infra, auth, cc payments, etc.
What's the app?!!
Would you mind sharing which app you released?
There are certain classes of problems that LLMs are good at. Accurately regurgitating all accumulated world knowledge ever is not one, so don’t ask a language model to diagnose your medical condition or choose a political candidate.
But do ask them to perform suitable tasks for a language model! Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report. It’s super cool!
There are lots of cases like this where you can give an LLM reliable data and ask it to do a language related task and it will do an excellent job of it.
If nothing else it’s an extremely useful computer-human interface.
not to dissuade you from a thing you find useful but are you aware that the national weather service produces an Area Forecast Discussion product in each local NWS office daily or more often that accomplishes this with human meteorologists and clickable jargon glossary?
https://forecast.weather.gov/product.php?site=SEW&issuedby=S...
(o1-preview) LLMs show promise in clinical reasoning but fall short in probabilistic tasks, underscoring why AI shouldn't replace doctors for diagnosis just yet.
"Superhuman performance of a large language model on the reasoning tasks of a physician" https://arxiv.org/abs/2412.10849 [14 Dec 2024]
I actually found 4o+search to be really good at this... Admittedly what I did was more "research these candidates, tell me anything newsworthy, pros/cons, etc" (much longer prompt) and well, it was way faster/patient at finding sources than I ever would've been, telling me things I never would've figured out with <5 minutes of googling each set of candidates (which is what I've done before).
Honestly my big rule for what LLMs are good at is stuff like "hard/tedious/annoying to do, easy to verify" and maybe a little more than that. (I think after using a model for a while you can get a "feel" for when it's likely BSing.)
Honestly they are very decent at it if you give them accurate information in which to make the diagnosis. The typical problem people have is being unable to feed accurate information to the model. They'll cut out parts they don't want to think about or not put full test results in for consideration.
You feed it a weather report and it responds with a weather report? How is that useful?
Deleted Comment
No, I think if we follow the money, we will find the problem.
People keep saying this, and there are use cases for which this is definitely the case, but I find the opposite to be just as true in some circumstances.
I'm surprised at how good LLMs are at answering "me be monkey, me have big problem with code" questions. For simple one-offs like "how to do x in Pandas" (a frequent one for me), I often just give Claude a mish-mash of keywords, and it usually figures out what I want.
An example prompt of mine from yesterday, which Claude successfully answered, was "python sha256 of file contents base64 safe for fs path."
With a system prompt to make Claude's output super brief and a command to execute queries from the terminal via Simon Willison's LLM tool, this is extremely useful.
Good communication with LLMs is the least keywords used to make it deducible for LLM what you exactly want.
and
> a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability.
I still hold that the innovations we've seen as an industry with text transfer to the data from other domains. And there's an odd misbehavior with people that I've now seen play out twice -- back in 2017 with vision models (please don't shove a picture of a spectrogram into an object detector), and today. People are trying to coerce text models to do stuff with data series, or (again!) pictures of charts, rather than paying attention to timeseries foundation models which directly can work on the data.[1]
Further, the tricks we're seeing with encoder / decoder pipelines should work for other domains. And we're not yet recognizing that as an industry. For example, whisper or the emerging video models are getting there, but think about multi-spectral satellite data, fraud detection (a type graph problem).
There's lots of value to unlock from coding models. They're just text models. So what if you were to shove an abstract syntax tree in as the data representation, or the intermediate code from LLVM or a JVM or whatever runtime and interact with that?
[1] https://huggingface.co/ibm-granite/granite-timeseries-ttm-r1 - shout-out to some former colleagues!
> It's a bit sad and confusing that LLMs ("Large Language Models") have little to do with language; It's just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something.
> They don't care if the tokens happen to represent little text chunks. It could just as well be little image patches, audio chunks, action choices, molecules, or whatever. If you can reduce your problem to that of modeling token streams (for any arbitrary vocabulary of some set of discrete tokens), you can "throw an LLM at it".
It's not as helpful as Google was ten years ago. It's more helpful than Google today, because Google search has slowly been corrupted by garbage SEO and other LLM spam, including their own suggestions.
I’ve written two large applications and about a dozen smaller ones using Claude as an assistant.
I’m a terrible front-end developer and almost none of that work was possible without Claude. The API and AWS deployment were sped up tremendously.
I’ve created unit tests and I’ve read through the resulting code and it’s very clean. One of my core pre-prompt requirements has always been to follow domain-driven design principles, something a novice would never understand.
I also start with design principles and a checklist that Claude is excellent at providing.
My only complaint is you only have a 3-4 hour window before you’re cutoff for a few hours.
And needing an enterprise agreement to have a walled garden for proprietary purposes.
I was not a fan in Q1. Q2 improved. Q3 was a massive leap forward.
Deleted Comment
Dead Comment
IME, being forced to write about something or verbally explaining/enumerating things in detail _by itself_ leads to a lot of clarity in the writer's thoughts, irrespective of if there's an LLM answering back.
People have been doing rubber-duck-debugging since long. The metaphorical duck (LLMs in our context), if explained to well, has now started answering back with useful stuff!
I see much deeper problems. Just to give two examples:
- I asked various AIs concerning explanations of proofs of some deep (established) mathematical theorems: the explanations were to my understanding very hallucinated, and thus worse than "obviously wrong". I also asked for literature references for some deep mathematical theory frameworks: bascially all of the references were again hallucinated.
- I asked lots of AIs on https://lmarena.ai/ to write a suitably long text about some political topic that is quite controversial in my country (but does have lots proponents even in a very radical formulation, even though most people would not use such a radical formulation in public). All of the LLMs that I checked refused or tried to indoctrinate me that this thesis is wrong. I did not ask the LLM to lecture me, but I gave it a concrete task! Society is deeply divided, so if the LLM only spreads propaganda of its political teaching, it will be useless for many tasks for a very significant share of the society.
Using a few messages to get them out of "I aim to be direct" AI assistant mode gets much better overall results for the rest of the chat.
Haiku is actually incredibly good at high level systems thinking. Somehow when they moved to a smaller model the "human-like" parts fell away but the logical parts remained at a similar level.
Like if you were taking meeting notes from a business strategy meeting and wanted insights, use Haiku over Sonnet, and thank me later.
If your model / chat app has the ability to always inject some kind of pre-prompt make sure to add something like “please do not jump to writing code. If this was a coding interview and you jumped to writing code without asking questions and clarifying requirements you’d fail”.
At the top of all your source files include a comment with the file name and path. If you have a project on one of these services add an artifact that is the directory tree (“tree —-gitignore” is my goto). This helps “unaided” chats get a sense of what documents they are looking at.
And also, it’s a professional bullshitter so don’t trust it with large scale code changes that rely on some language / library feature you don’t have personal experience with. It can send you down a path where the entire assumption that something was possible turns out to be false.
Does it seek like a lot of work? Yes. Am I actually more productive with the tool than without? Probably. But it sure as shit isn’t “free” in terms of time spent providing context. I think the more I use these models, the more I get a sense of what it is good at and what is going to be a waste of time.
Long story short, prompting is everything. These things aren’t mind readers (and worse they forget everything in each new session)
I could only ever really jam with 4o.
Makes me wonder if there's personal communication preferences at play here.
A very big surprise is just how much better Sonnet 3.5 is than Haiku. Even the confusingly-more-expensive-Haiku-variant Haiku 3.5 that's more recent than Sonnet 3.5 is still much worse.
I.e. over time it constitute a fundamental shift in how we interact with abstractions in computers. The current fundamentals will still remain but they will become increasingly malleable. Details in code will become less important. Architecture will become increasingly important. But at the same time the cost of refactoring or changing architecture will quickly drop.
Any details that are easily lost when passing through an LLM will be details that have the highest maintenance cost. Any important details that can be retained by an LLM can move up and down the ladder of abstraction at will.
Can an LLM based solution maintain software architectures without introducing noise? The answer to that is the difference between somewhat useful and game changing.
All the tasks I can think of dealing with on my own computer that would take hours, a) are actually pretty interesting to me and b) would equally well take hours to "provide perfect guidance". The drudge work of programming that I notice comes in blocks of seconds at a time, and the mental context switch to using an LLM would be costlier.
The GP is claiming GPT4o is bad but Sonnet is good. GPT4o is about only 20% cheaper than Sonnet.
My son throwing an irrational tantrum at the amusement park and I can't figure out why he's like that (he won't tell me or he doesn't know himself either) or what I should do? I feed Claude all the facts of what happened that day and ask for advice. Even if I don't agree with the advice, at the very least the analysis helps me understand/hypothesize what's going on with him. Sure beats having to wait until Monday to call up professionals. And in my experience, those professionals don't do a better job of giving me advice than Claude does.
It's weekend, my wife is sick, the general practitioner is closed, the emergency weekend line has 35 people in the queue, and I want some quick half-assed medical guidance that while I know might not be 100% reliable, is still better than nothing for the next 2 hours? Feed all the symptoms and facts to Claude/ChatGPT and it does an okay job a lot of the time.
I've been visiting Traditional Chinese Medicine (TCM) practitioner for a week now and my symptoms are indeed reducing. But TCM paradigm and concepts are so different from western medicine paradigms and concepts that I can't understand the doctor's explanation at all. Again, Claude does a reasonable job of explaining to me what's going on or why it works from a western medicine point of view.
Want to write a novel? Brainstorm ideas with GPT-4o.
I had a debate with a friend's child over the correct spelling of a Dutch word ("instabiel" vs "onstabiel"). Google results were not very clear. ChatGPT explained it clearly.
Just where is this "useless" idea coming from? Do people not have a life outside of coding?
It seems like you trust AI more than people and prefer it to direct human interaction. That seems to be satisfying a need for you that most people don't have.
You don’t understand how medicine works, at any level.
Yet you turn to a machine for advice, and take it at face value.
I say these things confidently, because I do understand medicine well enough to not to seek my own answers. Recently I went to a doctor for a serious condition and every notion I had was wrong. Provably wrong!
I see the same behaviour in junior developers that simply copy-paste in whatever they see in StackOverflow or whatever they got out of ChatGPT with a terrible prompt, no context, and no understanding on their part of the suitability of the answer.
This is why I and many others still consider AIs mostly useless. The human in the loop is still the critical element. Replace the human with someone that thinks that powdered rhino horn will give them erections, and the utility of the AI drops to near zero. Worse, it can multiply bad tendencies and bad ideas.
I’m sure someone somewhere is asking DeepSeek how best to get endangered animals parts on the black market.
If you aren't a coder, it's hard to find much utility in "Google, but it burns a tree whenever you make an API call, and everything it tells you might be wrong". I for one have never used it for anything else. It just hasn't ever come up.
It's great at cheating on homework, kids love GPTs. It's great at cheating in general, in interviews for instance. Or at ruining Christmas, after this year's LLM debacle it's unclear if we'll have another edition of Advent of Code. None of this is the technology's fault, of course, you could say the same about the Internet, phones or what have you, but it's hardly a point in favor either.
And if you are a coder, models like Claude actually do help you, but you have to monitor their output and thoroughly test whatever comes out of them, a far cry from the promises of complete automation and insane productivity gains.
If you are only a consumer of this technology, like the vast majority of us here, there isn't that much of an upside in being an early adopter. I'll sit and wait, slowly integrating new technology in my workflow if and when it makes sense to do so.
Happy new year, I guess.
Other than, y'know, using the new tools. As a programmer heavy forum, we focus a lot on LLMs' (lack of) correctness. There's more than a little bit of annoyance when things are wrong, like being asked to grab the red blanket and then getting into an argument over it being orange instead of what was important, someone needed the blanket because they were cold.
Most of the non-tech people who use ChatGPT that I've talked to absolutely love it because they don't feel it judges them for asking stupid questions and they have conversations about absolutely everything in their lives with it down to which outfit to wear to the party. There are wrong answers to that question as well, but they're far more subjective and just having another opinion in the room is invaluable. It's just a computer and won't get hurt if you totally ignore it's recommendations, and even better, it won't gloat (unless you ask it to) if you tell it later that it was right and you were wrong.
Some people have found upsides for themselves in their lives, even at this nascent stage. No one's forcing you to use one, but your job isn't going to be taken by AI, it's going to be taken by someone else who can outperform you that's using AI.
But not at exploring what is at the border of knowledge itself. And by converging on the conventional, LLMs actually lead you away from anything that actually extends.
> doing boring tasks for which you can provide perfect guidance
That's true but you never need an LLM for that. There are wonderful scripts written by wonderful people and provided for free almost all the time and for those who search in the right places. LLM companies benefit/profit of these without providing anything in return.
They are worse than people who grab FOSS and turn it into overpriced and aggressively marketed business models and services or people who threaten and sue FOSS for being better and free alternatives to their bloated and often "illegally telemetric" services.
> able to accelerate you
True, but you leave too much for data brokers and companies like Meta to abuse and exploit in the future. All that additional "interactional data" will do so much worse to humanity than all those previous data sets did in elections, for example, or pretty much all consumer markets. They will mostly accelerate all these dimwitted Fortune 5000 companies that have sabotaged consumers into way too much dumb shit - way more than is reasonable or "ok". And educated, wealthy and or tech-savvy people won't be able to avoid/evade any of that. Especially when it's paired with meds, drugs, foods, biases, fallacies, priming and so on and all the knowledge we will gain on bio-chemical pathways and human liability to sabotage.
They are great for coders, of course, everyone can be an army of clone-warriors with auto-complete on steroids now and nobody can tell you what to do with all that time that you now have and all that money, which, thanks to all of us but mostly our ancestors, is the default. The problem is the resulting hyper-amplified, augmented financial imbalance. It's gonna fuck our species if all the technical people don't restore some of that balance, and everybody knows what that means and what must be done.
I haven’t found anything comparably good for JetBrains IDEs yet, but I’m also not switching to something else as my main editor.
Each task / programming language / query requires trying different LLM models and novel ways of prompting. If it's not work-related (or work pays for the one you use) sending as much of the code as relevant also helps the answers be more useful.
Most of the people I meet that say LLMs are not useful have only tried one (flavor / plugin), do not know how to pre-prompt or prompt, and do not give the tools a chance. Try one or two things, say yep, it's not good and give up.
Still hard for me to admit that Prompt Engineering is a profession, but it's the same as Google Fu. Once you learn it you can become an LLM Ninja!
I do not believe LLMs are coming for my job (just yet) but do believe they are going to be able to replace some people, are useful and those that do not use them will be at a disadvantage.
Deleted Comment
Deleted Comment
Dead Comment
https://www2.math.upenn.edu/~ghrist/preprints/LAEF.pdf - this math textbook was written in just 55 days!
Paraphrasing the acknowledgements -
...Begun November 4, 2024, published December 28, 2024.
...assisted by Claude 3.5 sonnet, trained on my previous books...
...puzzles co-created by the author and Claude
...GPT-4o and -o1 were useful in latex configurations...doing proof-reading.
...Gemini Experimental 1206 was an especially good proof-reader
...Exercises were generated with the help of Claude and may have errors.
...project was impossible without the creative labors of Claude
The obvious comparison is to the classic Strang https://math.mit.edu/~gs/everyone/ which took several *years* to conceptualize, write, peer review, revise and publish.
Ok maybe Strang isn't your cup of tea, :%s/Strang/Halmos/g , :%s/Strang/Lipschutz/g, :%s/Strang/Hefferon/g, :%s/Strang/Larson/g ...
Working through the exercises in this new LLMbook, I'm thinking...maybe this isn't going to stand the test of time. Maybe acceleration is not so hot after all.
Maybe I'm not the target audience, but... that really doesn't make me interested in continuing to read.
Great on the surface but lacks any depth, cohesive, or substance
Then I'd have Claude create text. I'd then edit/refine each chapter's text.
Wow, was it unpleasant. It was kinda cool to see all the words put together, but editing the output was a slog.
It's bad enough editing your own writing, but for some reason this was even worse.
The date/time that divides my world into before/after is AlphaGo v Lee Sedol game 3 (2016). From that time forward, I don't dismiss out of hand speculations of how soon we can have intelligent machines. Ray Kurzweil date of 2045 is as good as any (and better than most) for an estimate. Like Moore's (and related) Laws, it's not about how but the historical pace of advancements crossing a fairly static point of human capability.
Application coding, requires much less intelligence than playing Go at these high levels. The main differences are concise representation and clear final outcome scoring. LLMs deal quite well with the fuzziness of human communications. There may be a few more pegs to place but when seems predictably unknown.
I wish the author qualified this more. How does one develop that skill?
What makes LLMs so powerful on a day to day basis without a large RAG system around it?
Personally, I try LLMs every now and then, but haven’t seen any indication of their usefulness for my day to day outside of being a smarter auto complete.
In my experience, LLM tools are the same, you ask for something basic initially and then iteratively refine the query either via dialog or a new prompt until you get what you are looking for or hit the end of the LLM's capability. Knowing when you've reached the latter is critically important.
* Most existing LLM interfaces are very bad at editing history, instead focusing entirely on appending to history. You can sort of ignore this for one-shot, and this can be properly fixed with additional custom tools, but ...
* By the time you refine your input enough to patch over all the errors in the LLM's output for your sensible input, you're bigger than the LLM can actually handle (much smaller than the alleged context window), so it starts randomly ignoring significant chunks of what you wrote (unlike context-window problems, the ignored parts can be anywhere in the input).
I'd love to figure this out. I've written more about them than most people at this point, and my goal has always been to help people learn what they can and cannot do - but distilling that down to a concise set of lessons continues to defeat me.
The only way to really get to grips with them is to use them, a lot. You need to try things that fail, and other things that work, and build up an intuition about their strengths and weaknesses.
The problem with intuition is it's really hard to download that into someone else's head.
I share a ton of chat conversations to show how I use them - https://simonwillison.net/tags/tools/ and https://simonwillison.net/tags/ai-assisted-programming/ have a bunch of links to my exported Claude transcripts.
My first stab at trying ChatGPT last year was asking it to write some Rust code to do audio processing. It was not a happy experience. I stepped back and didn't play with LLMs at all for a while after that. Reading your posts has helped me keep tabs on the state of the art and decide to jump back in (though with different/easier problems this time).
Deleted Comment
You're misrepresenting it here.
The point of that post isn't "look at these incredible projects I've built (proceeds to show simple projects)."
It's "I built 14 small and useful tools in a single week, each taking between 2 and 10 minutes".
The thing that's interesting here is that I can have an LLM kick out a working prototype of a small, useful tool in only a little more time than it takes to run a Google search.
That post isn't meant to be about writing "real production code". I don't know why people are confused over that.
Instead, think of an LLM as the equivalent of giving a human a menial task. You know that they're not 100% reliable, and so you give them only tasks that you can quickly verify and correct.
Abstract that out a bit further, and realize that most managers don't expect their reports to be 100% reliable.
Don't use LLMs where accuracy is paramount. Use it to automate away tedious stuff. Examples for me:
Cleaning up speech recognition. I use a traditional voice recognition tool to transcribe, and then have GPT clean it up. I've tried voice recognition tools for dictation on and off for over a decade, and always gave up because even a 95% accuracy is a pain to clean up. But now, I route the output to GPT automatically. It still has issues, but I now often go paragraphs before I have to correct anything. For personal notes, I mostly don't even bother checking its accuracy - I do it only when dictating things others will look at.
And then add embellishments to that. I was dictating out a recipe I needed to send to someone. I told GPT up front to write any number that appears next to an ingredient as a numeral (i.e. 3 instead of "three"). Did a great job - didn't need to correct anything.
And then there are always the "I could do this myself but I didn't have time so I gave it to GPT" category. I was giving a presentation that involved graphs (nodes, edges, etc). I was on a tight deadline and didn't want to figure out how to draw graphs. So I made a tabular representation of my graph, gave it to GPT, and asked it to write graphviz code to make that graph. It did it perfectly (correct nodes and edges, too!)
Sure, if I had time, I'd go learn graphviz myself. But I wouldn't have. The chances I'll need graphviz again in the next few years is virtually 0.
I've actually used LLMs to do quick reformatting of data a few times. You just have to be careful that you can verify the output quickly. If it's a long table, then don't use LLMs for this.
Another example: I have a custom note taking tool. It's just for me. For convenience, I also made an HTML export. Wouldn't it be great if it automatically made alt text for each image I have in my notes? I would just need to send it to the LLM and get the text. It's fractions of a cent per image! The current services are a lot more accurate at image recognition than I need them to be for this purpose!
Oh, and then of course, having it write Bash scripts and CSS for me :-) (not a frontend developer - I've learned CSS in the past, but it's quicker to verify whatever it throws at me than Google it).
Any time you have a task and lament "Oh, this is likely easy, but I just don't have the time" consider how you could make an LLM do it.
Then why do people keep pushing it for code related tasks?
Accuracy and precision is paramount with code. It needs to express exactly what needs to be done and how.
The problem is: for the tasks that I can give the LLM (or human) that I can easily verify and correct, the LLM fails with the majority of them, for example
- programming tasks of my area of expertise (which is more "mathematical" than what is common in SV startups), where I know how a high-level solution has to look like, and where I can ask the LLM to explain the gory details to me. Yes, these gory details are subtle (which is why the task can be menial), but the code has to be right. I can verify this, and the code is not correct.
- getting literature references about more obscure scientific (in particular mathematical) topics. I can easily check whether these literature references (or summaries of these references) are hallucinations - they typically are.
My programmer mind tells me that "tedious stuff" is where accuracy is the most important.
The best prompts though are always written in a separate text file for me and pasted in. Follow up questions are never as good as a detailed initial prompt.
I would imagine well formulated questions to solve the problem at hand is a skill but beyond that I don't think there is anything special about how to ask LLMs a question.
In areas the LLM is rather useless, no amount of variation in prompting can solve that problem IMO. Just like if the tasks is something the LLM is good at, the prompt can be pretty sloppy and seem like magic with how it can understand what you want.
The tricky problem with LLMs is identifying failures - if you're asking the question, it's implied that you don't have enough context to assess whether it's a hallucination or a good recommendation! One approach is to build ensembles of agents that can check each other's work, but that's a resource-intensive solution.
Let people work how they want. I wouldn’t not hire someone on the basis of them not using a language server.
The creator of the Odin language famously doesn’t use one. He’s says that he, specifically, is faster without one.
Recently, I shared a code base with a junior dev and she was surprised with the speed and sophistication of the code. The LLM did 80+% of the "coding".
What was telling was as she was grokking the code (for helping the ~20%), she was surprised at the quality of the code - her use of the LLM did not yield code of similar quality.
I find that the more domain awareness one brings to the table, the better the output is. Basically the clearer one's vision of the end-state, the better the output.
One other positive side-effect of using "LLMs as a junior-dev" for me has been that my ambitions are greater. I want it all - better code, more sophisticated capabilities even for relatively not-important projects, documentation, tests, debug-ability. And once the basic structure is in place, many a time it is trivial to get the rest.
It's never 100%, but even with 80+%, I am faster than ever before, deliver better quality code, and can switch domains multiple times a week and never feel drained.
Sharing best AI hacks within a team will have the same effect as code-reviews do in ensuring consistency. Perhaps an "LLM chat review", especially when something particularly novel was accomplished!
Once that's all done, you basically have a well-structured question you could pass to an underling and have them completely independently work on the project without bugging you. That's the goal. Now, pass that to o1 or Claude, depending on whether it's a general-purpose task (o1) or a code-specific task (Claude), and wait for response. From there, have a conversation or test-and-followup of whatever it spits out, this time with you asking questions. If good enough, done. If not, wrap up whatever useful insights from that line of questioning and put it back into the initial prompt and either re-post it at the end of the conversation or start a fresh conversation.
I find 90% of the time this gets exactly what I'm after eventually. The few other cases are usually because we hit some cycle where the AI doesn't fully know what to change/respond, and it keeps repeating itself when I ask. The trick then is to ask things a different way or emphasize something new. This is usually just a code-specific issue, for general problems it's much better. One other trick is to ask it to take a step back and just tackle the problem in a theoretical/philosophical way first before trying to do any coding or practical solving, and then do that in a second phase (asking o1 to architect code structure and then Claude to implement it is a great combo too). Also if there is any way to break up the problem into smaller pieces which can be tackled one conversation at a time - much better. Just remember to include all relevant context it needs to interface with the overall problem too.
That sounds like a lot, but it's essentially just project management and delegation to somewhat-flawed underlings. The upside is instead of waiting a workweek for them to get back to you, you just have to wait 20 seconds. But it does mean a ton of reading and writing. There are certainly already some meta-prompts where you can get the AI to essentially do this whole process for you and assess itself, but like all automation that means extra ways for things to break too. Let the AI devs cook though and those will be a lot more commonplace soon enough...
[Edit: o1 mostly agrees lol. Some good additional suggestions for systematizing this: https://chatgpt.com/share/6775b85c-97c4-8003-bd31-ee288396ab... ]
Dead Comment
>LLM prices crashed
This one has me a little spooked. The white knight on this front (DS) has both announced increases and has had staff poached. There is still Gemini free tier which is ofc basically impossible to beat (solid & functionally unlimited/free) but it's google so reluctant to trust.
Seriously worried about seeing a regression on pricing in first half of 2025. Especially with the OAI $200 price anchoring.
>“Agents” still haven’t really happened yet
Think that's largely because it's a poorly defined concept and true "agent" implies some sort of pseudo-agi autonomy. This is a definition/expectation issue rather than technical in my mind
>LLMs somehow got even harder to use
I don't think that's 100%. An explosion of options is not equal to harder to use. And the guidance for noobs is still pretty much same as always (llama.cp or one of the common frontends like text-generation-webui). It's become harder to tell what is good, but not to get going.
----
One key theme I think is missing is just how hard it has become to tell what is "good" for the average user. There is so much benchmark shenanigans going on that it's just impossible to tell. I'm literally at the "I'm just going to build my own testing framework" stage. Not because I can do better technically (I can't)...but because I can gear it towards things I care about and I can be confident my DIY sample hasn't been gamed.
These companies are incentivized to figure out fast and efficient hosting for the models. They don't need to train any models themselves, their value is added entirely in continuing to drive the price of inference down.
Groq and Cerberus are particularly interesting here because WOW they serve Llama fast.
Is it free free? The last time I checked there was a daily request limit, still generous but limiting for some use cases. Isn't it still the case?
That's an indication that most business-sized models won't need some giant data center. This is going to be a cheap technology most of the time. OpenAI is thus way overvalued.
This means that the definitions of "laptop" and "server" are dependent on use. We should instead talk about RAM, GPU and CPU speed which is more useful and informative but less engaging than "my laptop".
However, it has been clear for a long time that meta are just demolishing any competitor's moats, driving the whole megacorp AI competition to razor thin margins.
It's a very welcome strategy from a consumer pov, but -- it has to be said -- genius from a business pov. By deciding that no one will win, it can prevent anyone leapfrogging them at a relatively cheap price.
That is of course, assuming AGI is possible and exponential, and that marketshare goes to a single entity instead of a set of entities. Lots of big assumptions. Seems like we're heading towards a slow-lackluster singularity though.
People are buying shares at $x because they believe they will be able to sell them for more later. I don’t think there’s a whole to more to it than that.
OpenAI predicts more revenue from ChatGPT than api access through 2029.
It’s the old Netflix / HBO trope of which can become the other first: hbo figure out streaming or Netflix figure out original programming.
I bet Google will figure this out and thus OpenAI won’t disrupt as much as people think it will.
The non-skeptical interpretation is that it's a threshold function, a flat-out race with an unambiguous finish line. If someone actually hit self-improving AGI first there's an argument that no one would ever catch up.
They run on a laptop, yes - you might squeeze up to 10 token/sec out of a kinda sorta GPT-4 if you paid $5K plus for an Apple laptop in the last 18 months.
And that's after you spent 2 minutes watching 1000 token* prompt prefill at 10 tokens/sec.
Usually it'd be obvious this'd trickle down, things always do, right?
But...Apple infamously has been stuck on 8GB of RAM in even $1500 base models for years. I have 0 idea why, but my intuition is RAM was ~doubling capacity at same cost every 3 years till early 2010s, then it mostly stalled out post 2015.
And regardless of any of the above, this absolutely melts your battery. Like, your 16 hr battery life becomes 40 minutes, no exaggeration.
I don't know why prefill (loading in your prompt) is so slow for local LLMs, but it is. I assume if you have a bunch of servers there's some caching you can do that works across all prompts.
I expect the local LLM community to be roughly the same size it is today 5 years from now.
* ~3 pages / ~750 words; what I expect is a conservative average for prompt size when coding
Most web servers can run some number of QPS on a developer laptop, but AWS is a big business, because there are a heck of a lot of QPS across all the servers.
Consumer GPUs top out at 24 GB VRAM.
Deleted Comment
Then, several headings later:
> I have it on good authority that neither Google Gemini nor Amazon Nova (two of the least expensive model providers) are running prompts at a loss.
So...which is it?
They're not running at a loss. I'll fix that.
This 100%. “Agentic” especially as a buzzword can piss off
My problem is when people use that definition (or any other) without clarifying, because they assume it's THE obvious definition.
Deleted Comment
The money is still flowing, for now, to subsidize that fiasco but as soon as that starts to slow, even just a bit, things are gonna get bumpy real quick. Super excited about this tech but there are dark storm clouds building on the horizon and absent a major “moat” breakthrough it’s gonna get rough soon.
That’s exactly what happened with rideshare companies. It was an amazing new thing but subsidized in an unsustainable way, then a bunch of companies exited the space when it was an commoditized race to the bottom and those left let quality slip. Now when you order an Uber a car shows up that smells bad and has wheels about to fall off. The consumer experience was a lot better when Uber was a VC subsidized bonanza