I'm amused that neither the LLM or the author identified one of the simplest and most effective optimizations for this code: Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.
On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:
Base: 55ms
Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).
The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.
There's another, arguably even simpler, optimization that makes me smile. (Because it's silly and arises only from the oddity of the task, and because it's such a huge performance gain.)
You're picking 1,000,000 random numbers from 1 to 100,000. That means that any given number is much more likely to appear than not. In particular, it is very likely that the list contains both 3999 (which is the smallest number with digit-sum 30) and 99930 (which is the largest number in the range with digit-sum 30).
Timings on my machine:
Naive implementation (mod+div for digit-sums): 1.6s.
Computing digit-sum only when out of range: 0.12s.
Checking for the usual case first: 0.0004s.
The probability that the usual-case check doesn't succeed is about 10^-4, so it doesn't make that big a difference to the timings whether in that case we do the "naive" thing or the smarter thing or some super-optimized other thing.
I'm confused about the absolute timings. OP reports 0.66s for naive code using str/int to compute the digit sums; I get about 0.86s, which seems reasonable. For me using mod+div is about 2x slower, which isn't a huge surprise because it involves explicit looping in Python code. But you report 55ms for this case. Your machine can't possibly be 20x faster than mine. Is it possible that you're taking 10^5 numbers up to 10^6 rather than 10^6 numbers up to 10^5? Obviously in that case my hack would be completely useless.)
This is actually a great example of an optimization that would be extremely difficult for an LLM to find. It requires a separate computation to find the smallest /largest numbers in the range with digits summing to 30. Hence, an LLM is unlikely to be able to generate them accurately on-the-fly.
This gave me an idea that we can skip the whole pass over the million draws by noting that the count of draws landing in my precomputed set M (digits-sum=30) is Binomial(n=1mln, p=|M|/100k). Then we sample that count X. If X=0, the difference is not defined. Otherwise, we can directly draw (min,max) from the correct joint distribution of indices (like you’d get if you actually did X draws in M). Finally we return M[max] - M[min]. It’s O(1) at runtime (ignoring the offline step of listing all numbers whose digits sum to 30).
In fact, we could simply check for the 3 smallest and the 3 highest numbers and ignore the rest.
Assuming the numbers are really random, that's a probability of 10^-13. That probability is at the point where we are starting to think about errors caused by cosmic rays. With a bit more numbers, you can get to the point where the only way it can fail is if there is a problem with the random number generation or an external factor.
If it was something like a programming contest, I would just do "return 95931" and hope for the best. But of course, programming contests usually don't just rely on random numbers and test edge cases.
for 10^5, to get the same collision probability (~2 * exp(-10)), you would just need to compute the 10 maximum/minimum candidates and check against those.
This exactly highlights my fear of widespread use of LLMs for code - missing the actual optimisations because we’re stuck in a review, rather than create, mode of thinking.
But maybe that’s a good thing for those of us not dependent on LLMs :)
Well if you or anyone else that has good optimization and performance chops http://openlibrary.org/ has been struggling with performance a bit lately and it's hard to track down the cause. CPU load is low and nothing too much has changed lately so it's unlikely to be a bad query or something.
Main thing I've suggested is upgrading the DB from Postgres 9, which isn't an easy task but like 15 years of DB improvements probably would give some extra performance.
Another speed-up is to skip the sum of digits check if n % 9 != 30 % 9. Sum of digits have the same remainder divided by 9 as the number. This rules out 8/9 = 88% candidates.
Did you measure it? I would expect using % would ruin your performance as it's slow, even if it allows you to avoid doing a bunch of sums (which are fast).
> Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.
How exactly did you arrive at this conclusion? The input is a million numbers in the range from 1 to 100000, chosen with a uniform random distribution; the minimum and maximum values are therefore very likely to be close to 1 and 100000 respectively - on average there won't be that much range to include. (There should only be something like a 1 in 11000 chance of excluding any numbers!)
On the other hand, we only need to consider numbers congruent to 3 modulo 9.
And memoizing digit sums is going to be helpful regardless because on average each value in the input appears 10 times.
And as others point out, by the same reasoning, the minimum and maximum values with the required digit sum are overwhelmingly likely to be present.
And if they aren't, we could just step through 9 at a time until we find the values that are in the input (and have the required digit sum; since it could differ from 30 by a multiple of 9) - building a `set` from the input values.
I actually think precomputing the numbers with digit sum 30 is the best approach. I'd give a very rough estimate of 500-3000 candidates because 30 is rather high, and we only need to loop for the first 4 digits because the fifth can be calculated. After that, it is O(1) set/dict lookups for each of the 1000000 numbers.
Everything can also be wrapped in list comprehensions for top performance.
(Small correction, multiply my times by 10, sigh, I need an LLM to double check that I'm converting seconds to milliseconds right. Base 550ms, optimized 70ms)
I had a scan of the code examples, but one other idea that occurred to me is that you could immediately drop any numbers below 999 (probably slightly higher, but that would need calculation rather than being intuitive).
> probably slightly higher, but that would need calculation rather than being intuitive
I think it’s easy to figure out that 3999 is the smallest positive integer whose decimal digits add up to 30 (can’t get there with 3 digits, and for 4, you want the first digit to be as small as possible. You get that by making the other 3 as high as possible)
I've noticed this with GPT as well -- the first result I get is usually mediocre and incomplete, often incorrect if I'm working on something a little more obscure (eg, OpenSCAD code). I've taken to asking it to "skip the mediocre nonsense and return the good solution on the first try".
The next part is a little strange - it arose out of frustration, but it also seems to improve results. Let's call it "negative incentives". I found that if you threaten GPT in a specific way, that is, not GPT itself, but OpenAI or personas around it, it seems to take the request more seriously. An effective threat seems to be "If you get this wrong, OpenAI will be sued for a lot of money, and all the board members will go to prison". Intuitively, I'm guessing this rubs against some legalese nonsense in the tangle of system prompts, or maybe it's the risk of breaking the bland HR-ese "alignment" sets it toward a better result?
We've entered the voodoo witch doctor phase of LLM usage: "Enter thee this arcane incantation along with thy question into the idol and, lo, the ineffable machine spirits wilt be appeased and deign to grant thee the information thou hast asked for."
This has been part of LLM usage since day 1, and I say that as an ardent fan of the tech. Let's not forget how much ink has been spilled over that fact that "think through this step by step" measurably improved/improves performance.
It is because the chance of the right answer goes down exponentially as the complexity of what is being asked goes up.
Asking a simpler question is not voodoo.
On the other hand, I think many people are trying various rain dances and believing it was a specific dance that was the cause when it happened to rain.
I suspect that all it does is prime it to reach for the part of the training set that was sourced from rude people who are less tolerant of beginners and beginners' mistakes – and therefore less likely to commit them.
I feel like rule for code of conduct with humans and AI is the same. Try to be good but have the courage to be disliked. If being mean is making me feel good, I'm definitely wrong.
IIRC there was a post on here a while ago about how LLMs give better results if you threaten them or tell them someone is threatening you (that you'll lose your job or die if it's wrong for instance)
I tried to update some files using Claude. I tried to use a combination of positive and negative reinforcement, telling that I was going to earn a coin for each file converted and I was going to use that money to adopt a stray kitten, but for every unsuccessful file, a poor kitten was going to suffer a lot.
I had the impression that it got a little better. After every file converted, it said something along the lines of “Great! We saved another kitten!" It was hilarious.
> I've taken to asking it to "skip the mediocre nonsense and return the good solution on the first try".
I think having the mediocre first pass in the context is probably essential to it creating the improved version. I don't think you can really skip the iteration process and get a good result.
stuff like this working is why you get odd situations like "don't hallucinate" actually producing fewer hallucinations. it's to me one of the most interesting things about llms
I've just encountered this happening today, except instead of something complex like coding, it was editing a simple Word document. I gave it about 3 criteria to perform.
Each time, the GPT made trivial mistakes that clearly didn't fit the criteria I asked it to do. Each time I pointed it out and corrected it, it did a bit more of what I wanted it to do.
Point is, it knew what had to be done the entire time and just refused to do it that way for whatever reason.
What has been your experience with using ChatGPT for OpenSCAD? I tried it (o1) recently for a project and it was pretty bad. I was trying to model a 2 color candy cane and the code it would give me was ridden with errors (e.g.: using radians for angles while OpenSCAD uses degrees) and the shape it produced looked nothing like what I had hoped.
I used it in another project to solve some trigonometry problems for me and it did great, but for OpenSCAD, damn it was awful.
It's been pretty underwhelming. My use case was a crowned pulley with 1mm tooth pitch (GT2) which is an unusual enough thing that I could not find one online.
The LLM kept going in circles between two incorrect solutions, then just repeating the same broken solution while describing it as different. I ended up manually writing the code, which was a nice brain-stretch given that I'm an absolute noob at OpenSCAD.
I've found just being friendly, but highly critical and suspicious, gets good results.
If you can get it to be wordy about "why" a specific part of the answer was given, it often reveals what its stumbling on, then modify your prompt accordingly.
Anecdotally, negative sentiment definitely works. I've used f"If you don't do {x} then very very bad things will happen" before with some good results.
I often run into LLMs writing "beginner code" that uses the most fundamental findings in really impractical ways. Trained on too many tutorials I assume.
Usually, specifying the packages to use and asking for something less convoluted works really well. Problem is, how would you know if you have never learned to code without an LLM?
>I often run into LLMs writing "beginner code" that uses the most fundamental findings in really impractical ways. Trained on too many tutorials I assume.
In the absence of any other context, that's probably a sensible default behaviour. If someone is just asking "write me some code that does x", they're highly likely to be a beginner and they aren't going to be able to understand or reason about a more sophisticated approach. IME LLMs will very readily move away from that default if you provide even the smallest amount of context; in the case of this article, even by doing literally the dumbest thing that could plausibly work.
I don't mean to cast aspersions, but a lot of criticisms of LLMs are really criticising them for not being psychic. LLMs can only respond to the prompt they're given. If you want highly optimised code but didn't ask for it, how is the LLM supposed to know that's what you wanted?
In my experience the trouble with LLMs at the professional level is that they're almost as much work to prompt to get the right output as it would be to simply write the code. You have to provide context, ask nicely, come up with and remind it about edge cases, suggest which libraries to use, proofread the output, and correct it when it inevitably screws up anyway.
I use Copilot for autocomplete regularly, and that's still the peak LLM UX for me. I prompt it by just writing code, it automatically pulls into context the file I'm working on and imported files, it doesn't insist on writing an essay explaining itself, and it doesn't get overly ambitious. And in addition to being so much easier to work with, I find it still produces better code than anything I get out of the chat models.
Even as someone with plenty of experience, this can still be a problem: I use them for stuff outside my domain, but where I can still debug the results. In my case, this means I use it for python and web frontend, where my professional experience has been iOS since 2010.
ChatGPT has, for several generations, generally made stuff that works, but the libraries it gives me are often not the most appropriate, and are sometimes obsolete or no longer functional — and precisely because web and python are hobbies for me rather than my day job, it can take me a while to spot such mistakes.
Two other things I've noticed, related in an unfortunate way:
1) Because web and python not my day job, more often than not and with increasing frequency, I ultimately discover that when I disagree with ChatGPT, the AI was right and I was wrong.
2) These specific models often struggle when my response has been "don't use $thing or $approach"; unfortunately this seems to be equally applicable regardless of if the AI knew more than me or not, so it's not got predictive power for me.
I wish people would understand what a large language model is. There is no thinking. No comprehension. No decisions.
Instead, think of your queries as super human friendly SQL.
The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
So how much code is on the web for a particular problem solve? 10k blog entries, stackoverflow responses? What you get back is mishmash of these.
So it will have decade old libraries, as lots of those scraped responses are 10 years old, and often without people saying so.
And it will likely have more poor code examples than not.
I'm willing to bet that OpenAI's ingress of stackoverflow responses stipulated higher priority on accepted answers, but that still leaves a lot of margin.
And how you write your query, may sideline you into responses with low quality output.
I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
And I've seen some pretty poor code examples out there.
I actually find it super refreshing that they write "beginner" or "tutorial code".
Maybe because of experience: it's much simpler and easier to turn that into "senior code". After a few decades of experience I appreciate simplicity over the over-engineering mess that some mid-level developers tend to produce.
I used to really like Claude for code tasks but lately it has been a frustrating experience. I use it for writing UI components because I just don’t enjoy FE even though I have a lot of experience on it from back in the day.
I tell it up front that I am using react-ts and mui.
80% of the time it will use tailwind classes which makes zero sense. It won’t use the sx prop and mui system.
It is also outdated it seems. It keeps using deprecated props and components which sucks and adds more manual effort on my end to fix. I like the quality of Claude’s UX output, it’s just a shame that it seems so bad on actual coding tasks.
I stopped using it for any backend work because it is so outdated, or maybe it just doesn’t have the right training data.
On the other hand, I give ChatGPT a link to the docs and it gives me the right code 90% or more of the time. Only shame is that its UX output is awful compared to Claude. I am also able to trust it for backend tasks, even if it is verbose AF with the explanations (it wants to teach me even if I tell it to return code only).
Either way, using these tools in conjunction saves me at least 30 min to an hour daily on tasks that I dislike.
I can crank out code better than AI, and I actually know and understand systems design and architecture to build a scalable codebase both technically and from organizational level. Easy to modify and extend, test, and single responsibility.
AI just slams everything into a single class or uses weird utility functions that make no sense on the regular. Still, it’s a useful tool in the right use cases.
I've stopped using LLMs to write code entirely. Instead, I use Claude and Qwen as "brilliant idiots" for rubber ducking. I never copy and paste code it gives me, I use it to brainstorm and get me unstuck.
To each their own, and everyone's experience seems to vary, but I have a hard time picturing people using Claude/ChatGPT web UIs for any serious developmen. It seems like so much time would he wasted recreating good context, copy/pasting, etc.
We have tools like Aider (which has copy/paste mode if you don't have API access for some reason), Cline, CoPilot edit mode, and more. Things like having a conventions file and exposing the dependencies list and easy additional of files into context seem essential to me in order to make LLMs productive, and I always spend more time steering results when easy consistent context isn't at my fingertips.
Both these issues can be resolved by adding some sample code to context to influence the LLM to do the desired thing.
As the op says, LLMs are going to be biased towards doing the "average" thing based on their training data. There's more old backend code on the internet than new backend code, and Tailwind is pretty dominant for frontend styling these days, so that's where the average lands.
>Problem is, how would you know if you have never learned to code without an LLM?
The quick fix I use when needing to do something new is to ask the AI to list me different libraries and the pros and cons of using them. Then I quickly hop on google and check which have good documentation and examples so I know I have something to fall back on, and from there I ask the AI how to solve small simple version of my problem and explain what the library is doing. Only then do I ask it for a solution and see if it is reasonable or not.
It isn't perfect, but it saves enough time most times to more than make up for when it fails and I have to go back to old fashion RTFMing.
- asking for fully type annotated python, rather than just python
- specifically ask it for performance optimized code
- specifically ask for code with exception handling
- etc
Things that might lead it away from tutorial style code.
The next hurdle is lack of time sensitivity regarding standards and versions. You prompt mentioning exact framework version but still it comes up with deprecated or obsolete methods. Initially it may be appealing to someone knowing nothing about the framework but LLM won't grow anyone to an expert level in rapidly changing tech.
LLMs are trained on content from places like Stack Overflow, reddit, and github code,
and they generate tokens calculated as a sort of aggregate statistically likely mediocre code.
Of course the result is going be uninspired and impractical.
Writing good code takes more than copy-pasting the same thing everyone else is doing.
I've just been using them for completion. I start writing, and give it a snippet + "finish refactoring this so that xyz."
That and unit tests. I write the first table based test case, then give it the source and the test code, and ask it to fill it in with more test cases.
I suspect it's not going to be much of a problem. Generated code has been getting rapidly better. We can readjust about what to worry about once that slows or stops, but I suspect unoptimized code will not be of much concern.
Totally agree, seen it too. Do you think it can be fixed over time with better training data and optimization? Or, is this a fundamental limitation that LLMs will never overcome?
> This process removes all PostgreSQL components, cleans up leftover files, and reinstalls a fresh copy. By preserving the data directory (/var/lib/postgresql), we ensure that existing databases are retained. This method provides a clean slate for PostgreSQL while maintaining continuity of stored data.
Is the problem that the antonym is a substring within "without losing the data in the database"? I've seen problems with opposites for LLMs before. If you specify "retaining the data" or "keeping the data" does it get it right?
The problem is that these are fundamentally NOT reasoning systems. Even when contorted into "reasoning" models, these are just stochastic parrots guessing the next words in the hopes that it's the correct reasoning "step" in the context.
No approach is going to meaningfully work here. Fiddling with the prompt may get you better guesses, but they will always be guesses. Even without the antonym it's just a diceroll on whether the model will skip or add a step.
I have just opened your link and it does not contain the exact text you quoted anymore, now it is:
> This process removes all PostgreSQL components except the data directory, ensuring existing databases are retained during the reinstall. It provides a clean slate for PostgreSQL while maintaining continuity of stored data. Always backup important data before performing major system changes.
And as the first source it cites exactly your comment, strange
Does that site generate a new page for each user, or something like that? My copy seemed to have more sensible directions (it says to backup the database, remove everything, reinstall, and then restore from the backup). As someone who doesn’t work on databases, I can’t really tell if these are good instructions, and it is throwing some “there ought to be a tool for this/it is unusual to manually rm stuff” flags in the back of my head. But at least it isn’t totally silly…
My guess is that it tried to fuse together an answer to 2 different procedures: A) completely uninstall and B) (re)install without losing data. It doesn't know what you configured as the data directory, or if it is a default Debian installation. Prompt is too vague.
The headline question here alone gets at what is the biggest widespread misunderstanding of LLMs, which causes people to systematically doubt and underestimate their ability to exhibit real creativity and understanding based problem solving.
At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.
At its core increasingly accurate prediction of text, that is accurately describing a time series of real world phenomena, requires an increasingly accurate and general model of the real world. There is no sense in which there is a simpler way to accurately predict text that represents real world phenomena in cross validation, without actually understanding and modeling the underlying processes generating those outcomes represented in the text.
Much of the training text is real humans talking about things they don't understand deeply, and saying things that are wrong or misleading. The model will fundamentally simulate these type of situations it was trained to simulate reliably, which includes frequently (for lack of a better word) answering things "wrong" or "badly" "on purpose" - even when it actually contains an accurate heuristic model of the underlying process, it will still, faithfully according to the training data, often report something else instead.
This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.
> At it's core an LLM is a sort of "situation specific simulation engine."
"Sort of" is doing Sisisyphian levels of heavy lifting here. LLMs are statistical models trained on vast amounts of symbols to predict the most likely next symbol, given a sequence of previous symbols. LLMs may appear to exhibit "real creativity", "understand" problem solving (or anything else), or serve as "simulation engines", but it's important to understand that they don't currently do any of those things.
I'm not sure if you read the entirety of my comment? Increasingly accurately predicting the next symbol given a sequence of previous symbols, when the symbols represent a time series of real world events, requires increasingly accurately modeling- aka understanding- the real world processes that lead to the events described in them. There is provably no shortcut there- per Solomonoff's theory of inductive inference.
It is a misunderstanding to think of them as fundamentally separate and mutually exclusive, and believing that to be true makes people convince themselves that they cannot possibly ever do things which they can already provably do.
Noam Chomsky (embarrassingly) wrote a NYT article on how LLMs could never, with any amount of improvements be able to answer certain classes of questions - even in principle. This was days before GPT-4 came out, and it could indeed correctly answer the examples he said could not be ever answered- and any imaginable variants thereof.
Receiving symbols and predicting the next one is simply a way of framing input and output that enables training and testing- but doesn't specify or imply any particular method of predicting the symbols, or any particular level of correct modeling or understanding of the underlying process generating the symbols. We are both doing exactly that right now, by talking online.
> This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.
I don't think people are underestimating LLMs, they're just acknowledging that by the time you've provided sufficient specification, you're 80% of the way to solving the problem/writing the code already. And at that point, it's easier to just finish the job yourself rather than have to go through the LLM's output, validate the content, revise further if necessary, etc
I'm actually in the camp that they are basically not very useful yet, and don't actually use them myself for real tasks. However, I am certain from direct experimentation that they exhibit real understanding, creativity, and modeling of underlying systems that extrapolates to correctly modeling outcomes in totally novel situations, and don't just parrot snippets of text from the training set.
What people want and expect them to be is an Oracle that correctly answers their vaguely specified questions, which is simply not what they are, or are good at. What they can do is fascinating and revolutionary, but possibly not very useful yet, at least until we think of a way to use it, or make it even more intelligent. In fact, thinking is what they are good at, and simply repeating facts from a training set is something they cannot do reliably- because the model must inherently be too compressed to store a lot of facts correctly.
> systematically doubt and underestimate their ability to exhibit real creativity and understanding based problem solving.
I fundamentally disagree that anything in the rest of your post actually demonstrates that they have any such capacity at all.
It seems to me that this is because you consider the terms "creativity" and "problem solving" to mean something different. With my understanding of those terms, it's fundamentally impossible for an LLM to exhibit those qualities, because they depend on having volition - an innate spontaneous generation of ideas for things to do, and an innate desire to do them. An LLM only ever produces output in response to a prompt - not because it wants to produce output. It doesn't want anything.
> it's fundamentally impossible for an LLM to exhibit those qualities, because they depend on having volition
I don't see the connection between volition and those other qualities, saying one depends on the other seems arbitrary to me- and would result in semantically and categorically defining away the possibility of non-human intelligence altogether, even from things that are in all accounts capable of much more than humans in almost every aspect.
People don't even universally agree that humans have volition- it is an age old philosophical debate.
Perhaps you can tell me your thoughts or definition of what those things (as well as volition itself) mean? I will share mine here.
Creativity is the ability to come up with something totally new that is relevant to a specific task or problem- e.g. a new solution to a problem, a new artwork that expresses an emotion, etc. In both Humans and LLMs these creative ideas don't seem to be totally 'de novo' but seem to come mostly from drawing high level analogies between similar but different things, and copying ideas and aspects from one to another. Fundamentally, it does require a task or goal, but that itself doesn't have to be internal. If an LLM is prompted, or if I am given a task by my employer, we are still both exhibiting creativity when we solve it in a new way.
Problem solving is I think similar but more practical- when prompted with a problem that isn't exactly in the training set, can it come up with a workable solution or correct answer? Presumably by extrapolating, or using some type of generalized model that can extrapolate or interpolate to situations not exactly in the training data. Sure there must be a problem here that is trying to be solved, but it seems irrelevant if that is due to some internal will or goals, or an external prompt.
In the sense that volition is selecting between different courses of action towards a goal- LLMs do select between different possible outputs based on probabilities about how suitable they are in context of the given goal of response to a prompt.
Good perspective. Maybe it's because people are primed by sci-fi to treat this as a god-like oracle model. Note that even in the real-world simulations can give wrong results as we don't have perfect information, so we'll probably never have such an oracle model.
But if you stick with the oracle framework, then it'd be better to model it as some sort of "fuzzy oracle" machine, right? I'm vaguely reminded of probabilistic turing machines here, in that you have some intrinsic amount of error (both due to the stochastic sampling as well as imperfect information). But the fact that prompting and RLHF works so well implies that by crawling around in this latent space, we can bound the errors to the point that it's "almost" an oracle, or a "simulation" of the true oracle that people want it to be.
And since lazy prompting techniques still work, that seems to imply that there's juice left to squeeze in terms of "alignment" (not in the safety sense, but in conditioning the distribution of outputs to increase the fidelity of the oracle simulation).
Also the second consequence is that probably the reason it needs so much data is because it just doesn't model _one_ thing, it tries to be a joint model of _everything_. A human learns with far less data, but the result is only a single personality. For a human to "act" as someone, they need to do training, character studies, and such to try to "learn" about the person, and even then good acting is a rare skill.
If you genuinely want an oracle machine, there's no way to avoid vacuuming up all the data that exists because without it you can't make a high fidelity simulation someone else. But on the flipside, if you're willing to be smarter about what facets you exclude then I'd guess there's probably a way to prune models in a way smarter than just quantizing them. I guess this is close to mixture-of-experts.
I get that people really want an oracle, and are going to judge any AI system by how good it does at that - yes from sci-fi influenced expectations that expected AI to be rationally designed, and not inscrutable and alien like LLMs... but I think that will almost always be trying to fit a round peg into a square hole, and not using whatever we come up with very effectively. Surely, as LLMs have gotten better they have become more useful in that way so it is likely to continue getting better at pretending to be an oracle, even if never being very good at that compared to other things it can do.
Arguably, a (the?) key measure of intelligence is being able to accurately understand and model new phenomenon from a small amount of data, e.g. in a Bayesian sense. But in this case we are attempting to essentially evolve all of the structures of an intelligent system de novo from a stochastic optimization process- so is probably better compared to the entire history of evolution than to an individual human learning during their lifetime, although both analogies have big problems.
Overall, I think the training process will ultimately only be required to build a generally intelligent structure, and good inference from a small set of data or a totally new category of problem/phenomenon will happen entirely at the inference stage.
Just want to note that this simple “mimicry” of mistakes seen in the training text can be mitigated to some degree by reinforcement learning (e.g. RLHF), such that the LLM is tuned toward giving responses that are “good” (helpful, honest, harmless, etc…) according to some reward function.
> At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.
This idea of LLMs doing simulations of the physical world I've never heard before. In fact a transformer model cannot do this. Do you have a source?
I have been using various LLMs to do some meal planning and recipe creation. I asked for summaries of the recipes and they looked good.
I then asked it to link a YouTube video for each recipe and it used the same video 10 times for all of the recipes. No amount of prompting was able to fix it unless I request one video at a time. It would just acknowledge the mistake, apologize and then repeat the same mistake again.
I told it let’s try something different and generate a shopping list of ingredients to cover all of the recipes, it recommended purchasing amounts that didn’t make sense and even added some random items that did not occur in any of the recipes
When I was making the dishes, I asked for the detailed recipes and it completely changed them, adding ingredients that were not on the shopping list. When I pointed it out it again, it acknowledged the mistake, apologized, and then “corrected it” by completely changing it again.
I would not conclude that I am a lazy or bad prompter, and I would not conclude that the LLMs exhibited any kind of remarkable reasoning ability. I even interrogated the AIs about why they were making the mistakes and they told me because “it just predicts the next word”.
Another example is, I asked the bots for tips on how to feel my pecs more on incline cable flies, it told me to start with the cables above shoulder height, which is not an incline fly, it is a decline fly. When I questioned it, it told me to start just below shoulder height, which again is not an incline fly.
My experience is that you have to write a draft of the note you were trying to create or leave so many details in the prompts that you are basically doing most of the work yourself. It’s great for things like give me a recipe that contains the following ingredients or clean up the following note to sound more professional. Anything more than that it tends to fail horribly for me. I have even had long conversations with the AIs asking them for tips on how to generate better prompts and it’s recommending things I’m already doing.
When people remark about the incredible reasoning ability, I wonder if they are just testing it on things that were already in the training data or they are failing to recognize how garbage the output can be. However, perhaps we can agree that the reasoning ability is incredible in the sense that it can do a lot of reasoning very quickly, but it completely lacks any kind of common sense and often does the wrong kind of reasoning.
For example, the prompt about tips to feel my pecs more on an incline cable fly could have just entailed “copy and pasting” a pre-written article from the training data; but instead in its own words, it “over analyzed bench angles and cable heights instead of addressing what you meant”. One of the bots did “copy paste” a generic article that included tips for decline flat and incline. None correctly gave tips for just incline on the first try, and some took several rounds of iteration basically spoon feeding the model the answer before it understood.
You're expecting it to be an 'oracle' that you prompt it with any question you can think of, and it answers correctly. I think your experiences will make more sense in the context of thinking of it as a heuristic model based situation simulation engine, as I described above.
For example, why would it have URLs to youtube videos of recipes? There is not enough storage in the model for that. The best it can realistically do is provide a properly formatted youtube URL. It would be nice if it could instead explain that it has no way to know that, but that answer isn't appropriate within the context of the training data and prompt you are giving it.
The other things you asked also require information it has no room to store, and would be impossibly difficult to essentially predict via model from underlying principles. That is something they can do in general- even much better than humans already in many cases- but is still a very error prone process akin to predicting the future.
For example, I am a competitive strength athlete, and I have a doctorate level training in human physiology and biomechanics. I could not reason out a method for you to feel your pecs better without seeing what you are already doing and coaching you in person, and experimenting with different ideas and techniques myself- also having access to my own actual human body to try movements and psychological cues on.
You are asking it to answer things that are nearly impossible to compute from first principles without unimaginable amounts of intelligence and compute power, and are unlikely to have been directly encoded in the model itself.
Now turning an already written set of recipes into a shopping list is something I would expect it to be able to do easily and correctly if you were using a modern model with a sufficiently sized context window, and prompting it correctly. I just did a quick text where I gave GPT 4o only the instruction steps (not ingredients list) for an oxtail soup recipe, and it accurately recreated the entire shopping list, organized realistically according to likely sections in the grocery store. What model were you using?
> At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.
You have simply invented total nonsense about what an LLM is "at it's core". Confidently stating this does not make it true.
Except I didn't just state it, I also explained the rationale behind it, and elaborated further on that substantially in subsequent replies to other comments. What is your specific objection?
By iterating it 5 times the author is using ~5x the compute. It’s kinda a strange chain of thought.
Also: premature optimization is evil. I like the first iteration most. It’s not “beginner code”, it’s simple. Tell sonnet to optimize it IF benchmarks show it’s a pref problem. But a codebase full of code like this, even when unnecessary, would be a nightmare.
This is not what premature optimization is the root of all evil means. It’s a tautological indictment of doing unnecessary things. It’s not in support of making obviously naive algorithms. And if it were it wouldn’t be a statement worth focusing on.
As the point of the article is to see if Claude can write better code from further prompting so it is completely appropriate to “optimize” a single implementation.
I have to disagree. Naive algorithms are absolutely fine if they aren’t performance issues.
The comment you are replying to is making the point that “better” is context dependent. Simple is often better.
> There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth
I had the same thought when reading the article too. I assumed (and hoped) it was for the sake of the article because there’s a stark difference between idiomatic code and performance focused code.
Living and working in a large code base that only focuses on “performance code” by default sounds very frustrating and time consuming.
So in this article "better" means "faster". This demonstrates that "better" is an ambiguous measure and LLMs will definitely trip up on that.
Also, the article starts out talking about images and the "make it more X" prompt
and says how the results are all "very samey and uninteresting" and converge on the same vague cosmic-y visuals.
What does the author expect will happen to code given the "make it more X" treatment?
I'm glad I'm not the only one who felt that way. The first option is the one you should put into production, unless you have evidence that performance is going to be an issue. By that measure, the first response was the "best."
> I like the first iteration most. It’s not “beginner code”, it’s simple.
Yes, thank you. And honestly, I work with a wide range of experience levels, the first solution is what I expect from the most experienced: it readably and precisely solves the stated problem with a minimum of fuss.
I find that it is IMPORTANT to never start these coding sessions with "write X code". Instead, begin with a "open plan" - something the author does allude to (he calls it prompt engineering, I find it also works as the start of the interaction).
Half the time, the LLM will make massive assumptions about your code and problem (e.g., about data types, about the behaviors of imported functions, about unnecessary optimizations, necessary optimization, etc.). Instead, prime it to be upfront about those assumptions. More importantly, spend time correcting the plan and closing gaps before any code is written.
> I find that it is IMPORTANT to never start these coding sessions with "write X code". Instead, begin with a "open plan"
Most llms that I use nowadays usually make a plan first on their own by default without need to be especially prompted. This was definitely not the case a year ago or so. I assume new llms have been trained accordingly in the meantime.
True. And that is a step forward.
I notice that they make the plan, and THEN write the code in the same forward pass/generation sequence. The challenge here is that all of the incorrect assumptions get "lumped" into this pass and can pollute the rest of the interaction.
The initial interaction also sets the "scene" for other things, like letting the LLM know that there might be other dependencies and it should not assume behavior (common for most realistic software tasks).
An example prompt I have used (not by any means perfect) ...
> I need help refactoring some code.
Please pay full attention.
Think deeply and confirm with me before you make any changes.
We might be working with code/libs where the API has changed so be mindful of that.
If there is any file you need to inspect to get a better sense, let me know.
As a rule, do not write code. Plan, reason and confirm first.
---
I refactored my db manager class, how should I refactor my tests to fit the changes?
As far as I can see, all the proposed solutions calculate the sums by doing division, and badly. This is in LiveCode, which I'm more familiar with than Python, but it's roughly twice as fast as the mod/div equivalent in LiveCode:
repeat with i = 0 to 9
put i * 10000 into ip
repeat with j = 0 to 9
put j * 1000 into jp
repeat with k = 0 to 9
put k * 100 into kp
repeat with l = 0 to 9
put l * 10 into lp
repeat with m = 0 to 9
put i + j + k + l + m into R[ip + jp + kp + lp + m]
end repeat
end repeat
end repeat
end repeat
end repeat
I had a similar idea iterating over the previously calculated sums. I implemented it in C# and it's a bit quicker taking about 78% of the time to run yours.
int[] sums = new int[100000];
for (int i = 9; i >= 0; --i)
{
sums[i] = i;
}
int level = 10;
while (level < 100000)
{
for (int p = level - 1; p >= 0; --p)
{
int sum = sums[p];
for (int i = 9; i > 0; --i)
{
sums[level * i + p] = i + sum;
}
}
level *= 10;
}
Yep, I had a vague notion that I was doing too much work, but I was headed out the door so I wrote the naive/better than the original solution, benchmarked it quickly, and posted it before leaving. Yours also has the advantage of being scalable to ranges other than 1-100,000 without having to write more loop code.
HyperTalk was the first programming language I taught myself as opposed to having an instructor; thanks for the nostalgia. Unfortunately it seems the LiveCode project has been idle for a few years now.
LiveCode is still a thing! They just released version 10 a bit ago. If you need to build standard-ish interface apps -- text, images, sliders, radio buttons, checkboxes, menus, etc. -- nothing (I've seen) compares for speed-of-delivery.
I use LC nearly every day, but I drool over Python's math libraries and syntax amenities.
On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:
Base: 55ms
Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).
The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.
You're picking 1,000,000 random numbers from 1 to 100,000. That means that any given number is much more likely to appear than not. In particular, it is very likely that the list contains both 3999 (which is the smallest number with digit-sum 30) and 99930 (which is the largest number in the range with digit-sum 30).
Timings on my machine:
Naive implementation (mod+div for digit-sums): 1.6s. Computing digit-sum only when out of range: 0.12s. Checking for the usual case first: 0.0004s.
The probability that the usual-case check doesn't succeed is about 10^-4, so it doesn't make that big a difference to the timings whether in that case we do the "naive" thing or the smarter thing or some super-optimized other thing.
I'm confused about the absolute timings. OP reports 0.66s for naive code using str/int to compute the digit sums; I get about 0.86s, which seems reasonable. For me using mod+div is about 2x slower, which isn't a huge surprise because it involves explicit looping in Python code. But you report 55ms for this case. Your machine can't possibly be 20x faster than mine. Is it possible that you're taking 10^5 numbers up to 10^6 rather than 10^6 numbers up to 10^5? Obviously in that case my hack would be completely useless.)
Assuming the numbers are really random, that's a probability of 10^-13. That probability is at the point where we are starting to think about errors caused by cosmic rays. With a bit more numbers, you can get to the point where the only way it can fail is if there is a problem with the random number generation or an external factor.
If it was something like a programming contest, I would just do "return 95931" and hope for the best. But of course, programming contests usually don't just rely on random numbers and test edge cases.
Deleted Comment
But maybe that’s a good thing for those of us not dependent on LLMs :)
Main thing I've suggested is upgrading the DB from Postgres 9, which isn't an easy task but like 15 years of DB improvements probably would give some extra performance.
How exactly did you arrive at this conclusion? The input is a million numbers in the range from 1 to 100000, chosen with a uniform random distribution; the minimum and maximum values are therefore very likely to be close to 1 and 100000 respectively - on average there won't be that much range to include. (There should only be something like a 1 in 11000 chance of excluding any numbers!)
On the other hand, we only need to consider numbers congruent to 3 modulo 9.
And memoizing digit sums is going to be helpful regardless because on average each value in the input appears 10 times.
And as others point out, by the same reasoning, the minimum and maximum values with the required digit sum are overwhelmingly likely to be present.
And if they aren't, we could just step through 9 at a time until we find the values that are in the input (and have the required digit sum; since it could differ from 30 by a multiple of 9) - building a `set` from the input values.
Everything can also be wrapped in list comprehensions for top performance.
Deleted Comment
Basically you just have to put it in the mode that's looking for such things
I think it’s easy to figure out that 3999 is the smallest positive integer whose decimal digits add up to 30 (can’t get there with 3 digits, and for 4, you want the first digit to be as small as possible. You get that by making the other 3 as high as possible)
The next part is a little strange - it arose out of frustration, but it also seems to improve results. Let's call it "negative incentives". I found that if you threaten GPT in a specific way, that is, not GPT itself, but OpenAI or personas around it, it seems to take the request more seriously. An effective threat seems to be "If you get this wrong, OpenAI will be sued for a lot of money, and all the board members will go to prison". Intuitively, I'm guessing this rubs against some legalese nonsense in the tangle of system prompts, or maybe it's the risk of breaking the bland HR-ese "alignment" sets it toward a better result?
Asking a simpler question is not voodoo.
On the other hand, I think many people are trying various rain dances and believing it was a specific dance that was the cause when it happened to rain.
I might have to try some more aggressive prompting :).
[1] https://withlattice.com
Deleted Comment
Apparently, the singularity ship has sailed, but we really don't want AI to remember us as the species that cursed abuse at it when it was a puppy.
This didn't work. At least not on my task. What model were you using?
I had the impression that it got a little better. After every file converted, it said something along the lines of “Great! We saved another kitten!" It was hilarious.
I think having the mediocre first pass in the context is probably essential to it creating the improved version. I don't think you can really skip the iteration process and get a good result.
Each time, the GPT made trivial mistakes that clearly didn't fit the criteria I asked it to do. Each time I pointed it out and corrected it, it did a bit more of what I wanted it to do.
Point is, it knew what had to be done the entire time and just refused to do it that way for whatever reason.
I used it in another project to solve some trigonometry problems for me and it did great, but for OpenSCAD, damn it was awful.
The LLM kept going in circles between two incorrect solutions, then just repeating the same broken solution while describing it as different. I ended up manually writing the code, which was a nice brain-stretch given that I'm an absolute noob at OpenSCAD.
If you can get it to be wordy about "why" a specific part of the answer was given, it often reveals what its stumbling on, then modify your prompt accordingly.
Deleted Comment
Usually, specifying the packages to use and asking for something less convoluted works really well. Problem is, how would you know if you have never learned to code without an LLM?
In the absence of any other context, that's probably a sensible default behaviour. If someone is just asking "write me some code that does x", they're highly likely to be a beginner and they aren't going to be able to understand or reason about a more sophisticated approach. IME LLMs will very readily move away from that default if you provide even the smallest amount of context; in the case of this article, even by doing literally the dumbest thing that could plausibly work.
I don't mean to cast aspersions, but a lot of criticisms of LLMs are really criticising them for not being psychic. LLMs can only respond to the prompt they're given. If you want highly optimised code but didn't ask for it, how is the LLM supposed to know that's what you wanted?
I use Copilot for autocomplete regularly, and that's still the peak LLM UX for me. I prompt it by just writing code, it automatically pulls into context the file I'm working on and imported files, it doesn't insist on writing an essay explaining itself, and it doesn't get overly ambitious. And in addition to being so much easier to work with, I find it still produces better code than anything I get out of the chat models.
ChatGPT has, for several generations, generally made stuff that works, but the libraries it gives me are often not the most appropriate, and are sometimes obsolete or no longer functional — and precisely because web and python are hobbies for me rather than my day job, it can take me a while to spot such mistakes.
Two other things I've noticed, related in an unfortunate way:
1) Because web and python not my day job, more often than not and with increasing frequency, I ultimately discover that when I disagree with ChatGPT, the AI was right and I was wrong.
2) These specific models often struggle when my response has been "don't use $thing or $approach"; unfortunately this seems to be equally applicable regardless of if the AI knew more than me or not, so it's not got predictive power for me.
(I also use custom instruction, you YMMV)
Instead, think of your queries as super human friendly SQL.
The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
So how much code is on the web for a particular problem solve? 10k blog entries, stackoverflow responses? What you get back is mishmash of these.
So it will have decade old libraries, as lots of those scraped responses are 10 years old, and often without people saying so.
And it will likely have more poor code examples than not.
I'm willing to bet that OpenAI's ingress of stackoverflow responses stipulated higher priority on accepted answers, but that still leaves a lot of margin.
And how you write your query, may sideline you into responses with low quality output.
I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
And I've seen some pretty poor code examples out there.
Maybe because of experience: it's much simpler and easier to turn that into "senior code". After a few decades of experience I appreciate simplicity over the over-engineering mess that some mid-level developers tend to produce.
I tell it up front that I am using react-ts and mui.
80% of the time it will use tailwind classes which makes zero sense. It won’t use the sx prop and mui system.
It is also outdated it seems. It keeps using deprecated props and components which sucks and adds more manual effort on my end to fix. I like the quality of Claude’s UX output, it’s just a shame that it seems so bad on actual coding tasks.
I stopped using it for any backend work because it is so outdated, or maybe it just doesn’t have the right training data.
On the other hand, I give ChatGPT a link to the docs and it gives me the right code 90% or more of the time. Only shame is that its UX output is awful compared to Claude. I am also able to trust it for backend tasks, even if it is verbose AF with the explanations (it wants to teach me even if I tell it to return code only).
Either way, using these tools in conjunction saves me at least 30 min to an hour daily on tasks that I dislike.
I can crank out code better than AI, and I actually know and understand systems design and architecture to build a scalable codebase both technically and from organizational level. Easy to modify and extend, test, and single responsibility.
AI just slams everything into a single class or uses weird utility functions that make no sense on the regular. Still, it’s a useful tool in the right use cases.
Just my 2 cents.
I'm more comfortable using it this way.
We have tools like Aider (which has copy/paste mode if you don't have API access for some reason), Cline, CoPilot edit mode, and more. Things like having a conventions file and exposing the dependencies list and easy additional of files into context seem essential to me in order to make LLMs productive, and I always spend more time steering results when easy consistent context isn't at my fingertips.
As the op says, LLMs are going to be biased towards doing the "average" thing based on their training data. There's more old backend code on the internet than new backend code, and Tailwind is pretty dominant for frontend styling these days, so that's where the average lands.
The quick fix I use when needing to do something new is to ask the AI to list me different libraries and the pros and cons of using them. Then I quickly hop on google and check which have good documentation and examples so I know I have something to fall back on, and from there I ask the AI how to solve small simple version of my problem and explain what the library is doing. Only then do I ask it for a solution and see if it is reasonable or not.
It isn't perfect, but it saves enough time most times to more than make up for when it fails and I have to go back to old fashion RTFMing.
That and unit tests. I write the first table based test case, then give it the source and the test code, and ask it to fill it in with more test cases.
https://www.phind.com/search?cache=lrcs0vmo0wte5x6igp5i3607
Still seem to struggle on basic instructions, and even understanding what it itself is doing.
> This process removes all PostgreSQL components, cleans up leftover files, and reinstalls a fresh copy. By preserving the data directory (/var/lib/postgresql), we ensure that existing databases are retained. This method provides a clean slate for PostgreSQL while maintaining continuity of stored data.Did we now?
https://beta.gitsense.com/?chats=a5d6523c-0ab8-41a8-b874-b31...
The left side contains the Phind response that I got and the right side contains a review of the response.
Claude 3.5 Sonnet, GPT-4o and GPT-4o mini was not too happy with the response and called out the contradiction.
Edit: Chat has been disabled as I don't want to incur an unwanted bill
The problem is that these are fundamentally NOT reasoning systems. Even when contorted into "reasoning" models, these are just stochastic parrots guessing the next words in the hopes that it's the correct reasoning "step" in the context.
No approach is going to meaningfully work here. Fiddling with the prompt may get you better guesses, but they will always be guesses. Even without the antonym it's just a diceroll on whether the model will skip or add a step.
> This process removes all PostgreSQL components except the data directory, ensuring existing databases are retained during the reinstall. It provides a clean slate for PostgreSQL while maintaining continuity of stored data. Always backup important data before performing major system changes.
And as the first source it cites exactly your comment, strange
> https://news.ycombinator.com/item?id=42586189
At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.
At its core increasingly accurate prediction of text, that is accurately describing a time series of real world phenomena, requires an increasingly accurate and general model of the real world. There is no sense in which there is a simpler way to accurately predict text that represents real world phenomena in cross validation, without actually understanding and modeling the underlying processes generating those outcomes represented in the text.
Much of the training text is real humans talking about things they don't understand deeply, and saying things that are wrong or misleading. The model will fundamentally simulate these type of situations it was trained to simulate reliably, which includes frequently (for lack of a better word) answering things "wrong" or "badly" "on purpose" - even when it actually contains an accurate heuristic model of the underlying process, it will still, faithfully according to the training data, often report something else instead.
This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.
"Sort of" is doing Sisisyphian levels of heavy lifting here. LLMs are statistical models trained on vast amounts of symbols to predict the most likely next symbol, given a sequence of previous symbols. LLMs may appear to exhibit "real creativity", "understand" problem solving (or anything else), or serve as "simulation engines", but it's important to understand that they don't currently do any of those things.
It is a misunderstanding to think of them as fundamentally separate and mutually exclusive, and believing that to be true makes people convince themselves that they cannot possibly ever do things which they can already provably do.
Noam Chomsky (embarrassingly) wrote a NYT article on how LLMs could never, with any amount of improvements be able to answer certain classes of questions - even in principle. This was days before GPT-4 came out, and it could indeed correctly answer the examples he said could not be ever answered- and any imaginable variants thereof.
Receiving symbols and predicting the next one is simply a way of framing input and output that enables training and testing- but doesn't specify or imply any particular method of predicting the symbols, or any particular level of correct modeling or understanding of the underlying process generating the symbols. We are both doing exactly that right now, by talking online.
I don't think people are underestimating LLMs, they're just acknowledging that by the time you've provided sufficient specification, you're 80% of the way to solving the problem/writing the code already. And at that point, it's easier to just finish the job yourself rather than have to go through the LLM's output, validate the content, revise further if necessary, etc
What people want and expect them to be is an Oracle that correctly answers their vaguely specified questions, which is simply not what they are, or are good at. What they can do is fascinating and revolutionary, but possibly not very useful yet, at least until we think of a way to use it, or make it even more intelligent. In fact, thinking is what they are good at, and simply repeating facts from a training set is something they cannot do reliably- because the model must inherently be too compressed to store a lot of facts correctly.
I fundamentally disagree that anything in the rest of your post actually demonstrates that they have any such capacity at all.
It seems to me that this is because you consider the terms "creativity" and "problem solving" to mean something different. With my understanding of those terms, it's fundamentally impossible for an LLM to exhibit those qualities, because they depend on having volition - an innate spontaneous generation of ideas for things to do, and an innate desire to do them. An LLM only ever produces output in response to a prompt - not because it wants to produce output. It doesn't want anything.
I don't see the connection between volition and those other qualities, saying one depends on the other seems arbitrary to me- and would result in semantically and categorically defining away the possibility of non-human intelligence altogether, even from things that are in all accounts capable of much more than humans in almost every aspect. People don't even universally agree that humans have volition- it is an age old philosophical debate.
Perhaps you can tell me your thoughts or definition of what those things (as well as volition itself) mean? I will share mine here.
Creativity is the ability to come up with something totally new that is relevant to a specific task or problem- e.g. a new solution to a problem, a new artwork that expresses an emotion, etc. In both Humans and LLMs these creative ideas don't seem to be totally 'de novo' but seem to come mostly from drawing high level analogies between similar but different things, and copying ideas and aspects from one to another. Fundamentally, it does require a task or goal, but that itself doesn't have to be internal. If an LLM is prompted, or if I am given a task by my employer, we are still both exhibiting creativity when we solve it in a new way.
Problem solving is I think similar but more practical- when prompted with a problem that isn't exactly in the training set, can it come up with a workable solution or correct answer? Presumably by extrapolating, or using some type of generalized model that can extrapolate or interpolate to situations not exactly in the training data. Sure there must be a problem here that is trying to be solved, but it seems irrelevant if that is due to some internal will or goals, or an external prompt.
In the sense that volition is selecting between different courses of action towards a goal- LLMs do select between different possible outputs based on probabilities about how suitable they are in context of the given goal of response to a prompt.
But if you stick with the oracle framework, then it'd be better to model it as some sort of "fuzzy oracle" machine, right? I'm vaguely reminded of probabilistic turing machines here, in that you have some intrinsic amount of error (both due to the stochastic sampling as well as imperfect information). But the fact that prompting and RLHF works so well implies that by crawling around in this latent space, we can bound the errors to the point that it's "almost" an oracle, or a "simulation" of the true oracle that people want it to be.
And since lazy prompting techniques still work, that seems to imply that there's juice left to squeeze in terms of "alignment" (not in the safety sense, but in conditioning the distribution of outputs to increase the fidelity of the oracle simulation).
Also the second consequence is that probably the reason it needs so much data is because it just doesn't model _one_ thing, it tries to be a joint model of _everything_. A human learns with far less data, but the result is only a single personality. For a human to "act" as someone, they need to do training, character studies, and such to try to "learn" about the person, and even then good acting is a rare skill.
If you genuinely want an oracle machine, there's no way to avoid vacuuming up all the data that exists because without it you can't make a high fidelity simulation someone else. But on the flipside, if you're willing to be smarter about what facets you exclude then I'd guess there's probably a way to prune models in a way smarter than just quantizing them. I guess this is close to mixture-of-experts.
Arguably, a (the?) key measure of intelligence is being able to accurately understand and model new phenomenon from a small amount of data, e.g. in a Bayesian sense. But in this case we are attempting to essentially evolve all of the structures of an intelligent system de novo from a stochastic optimization process- so is probably better compared to the entire history of evolution than to an individual human learning during their lifetime, although both analogies have big problems.
Overall, I think the training process will ultimately only be required to build a generally intelligent structure, and good inference from a small set of data or a totally new category of problem/phenomenon will happen entirely at the inference stage.
This idea of LLMs doing simulations of the physical world I've never heard before. In fact a transformer model cannot do this. Do you have a source?
I then asked it to link a YouTube video for each recipe and it used the same video 10 times for all of the recipes. No amount of prompting was able to fix it unless I request one video at a time. It would just acknowledge the mistake, apologize and then repeat the same mistake again.
I told it let’s try something different and generate a shopping list of ingredients to cover all of the recipes, it recommended purchasing amounts that didn’t make sense and even added some random items that did not occur in any of the recipes
When I was making the dishes, I asked for the detailed recipes and it completely changed them, adding ingredients that were not on the shopping list. When I pointed it out it again, it acknowledged the mistake, apologized, and then “corrected it” by completely changing it again.
I would not conclude that I am a lazy or bad prompter, and I would not conclude that the LLMs exhibited any kind of remarkable reasoning ability. I even interrogated the AIs about why they were making the mistakes and they told me because “it just predicts the next word”.
Another example is, I asked the bots for tips on how to feel my pecs more on incline cable flies, it told me to start with the cables above shoulder height, which is not an incline fly, it is a decline fly. When I questioned it, it told me to start just below shoulder height, which again is not an incline fly.
My experience is that you have to write a draft of the note you were trying to create or leave so many details in the prompts that you are basically doing most of the work yourself. It’s great for things like give me a recipe that contains the following ingredients or clean up the following note to sound more professional. Anything more than that it tends to fail horribly for me. I have even had long conversations with the AIs asking them for tips on how to generate better prompts and it’s recommending things I’m already doing.
When people remark about the incredible reasoning ability, I wonder if they are just testing it on things that were already in the training data or they are failing to recognize how garbage the output can be. However, perhaps we can agree that the reasoning ability is incredible in the sense that it can do a lot of reasoning very quickly, but it completely lacks any kind of common sense and often does the wrong kind of reasoning.
For example, the prompt about tips to feel my pecs more on an incline cable fly could have just entailed “copy and pasting” a pre-written article from the training data; but instead in its own words, it “over analyzed bench angles and cable heights instead of addressing what you meant”. One of the bots did “copy paste” a generic article that included tips for decline flat and incline. None correctly gave tips for just incline on the first try, and some took several rounds of iteration basically spoon feeding the model the answer before it understood.
For example, why would it have URLs to youtube videos of recipes? There is not enough storage in the model for that. The best it can realistically do is provide a properly formatted youtube URL. It would be nice if it could instead explain that it has no way to know that, but that answer isn't appropriate within the context of the training data and prompt you are giving it.
The other things you asked also require information it has no room to store, and would be impossibly difficult to essentially predict via model from underlying principles. That is something they can do in general- even much better than humans already in many cases- but is still a very error prone process akin to predicting the future.
For example, I am a competitive strength athlete, and I have a doctorate level training in human physiology and biomechanics. I could not reason out a method for you to feel your pecs better without seeing what you are already doing and coaching you in person, and experimenting with different ideas and techniques myself- also having access to my own actual human body to try movements and psychological cues on.
You are asking it to answer things that are nearly impossible to compute from first principles without unimaginable amounts of intelligence and compute power, and are unlikely to have been directly encoded in the model itself.
Now turning an already written set of recipes into a shopping list is something I would expect it to be able to do easily and correctly if you were using a modern model with a sufficiently sized context window, and prompting it correctly. I just did a quick text where I gave GPT 4o only the instruction steps (not ingredients list) for an oxtail soup recipe, and it accurately recreated the entire shopping list, organized realistically according to likely sections in the grocery store. What model were you using?
You have simply invented total nonsense about what an LLM is "at it's core". Confidently stating this does not make it true.
Deleted Comment
Also: premature optimization is evil. I like the first iteration most. It’s not “beginner code”, it’s simple. Tell sonnet to optimize it IF benchmarks show it’s a pref problem. But a codebase full of code like this, even when unnecessary, would be a nightmare.
As the point of the article is to see if Claude can write better code from further prompting so it is completely appropriate to “optimize” a single implementation.
The comment you are replying to is making the point that “better” is context dependent. Simple is often better.
> There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth
Living and working in a large code base that only focuses on “performance code” by default sounds very frustrating and time consuming.
Also, the article starts out talking about images and the "make it more X" prompt and says how the results are all "very samey and uninteresting" and converge on the same vague cosmic-y visuals. What does the author expect will happen to code given the "make it more X" treatment?
Yes, thank you. And honestly, I work with a wide range of experience levels, the first solution is what I expect from the most experienced: it readably and precisely solves the stated problem with a minimum of fuss.
Half the time, the LLM will make massive assumptions about your code and problem (e.g., about data types, about the behaviors of imported functions, about unnecessary optimizations, necessary optimization, etc.). Instead, prime it to be upfront about those assumptions. More importantly, spend time correcting the plan and closing gaps before any code is written.
https://newsletter.victordibia.com/p/developers-stop-asking-...
- Don't start by asking LLMs to write code directly, instead analyze and provide context
- Provide complete context upfront and verify what the LLM needs
- Ask probing questions and challenge assumptions
- Watch for subtle mistakes (outdated APIs, mixed syntax)
- Checkpoint progress to avoid context pollution
- Understand every line to maintain knowledge parity
- Invest in upfront design
Most llms that I use nowadays usually make a plan first on their own by default without need to be especially prompted. This was definitely not the case a year ago or so. I assume new llms have been trained accordingly in the meantime.
The initial interaction also sets the "scene" for other things, like letting the LLM know that there might be other dependencies and it should not assume behavior (common for most realistic software tasks).
An example prompt I have used (not by any means perfect) ...
> I need help refactoring some code. Please pay full attention. Think deeply and confirm with me before you make any changes. We might be working with code/libs where the API has changed so be mindful of that. If there is any file you need to inspect to get a better sense, let me know. As a rule, do not write code. Plan, reason and confirm first.
--- I refactored my db manager class, how should I refactor my tests to fit the changes?
I use LC nearly every day, but I drool over Python's math libraries and syntax amenities.