Readit News logoReadit News
simianwords · 6 months ago
My interpretation of the progress.

3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.

I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.

* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.

o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.

o3 jump was incremental and so was gpt 5.

furyofantares · 6 months ago
I have a theory about why it's so easy to underestimate long-term progress and overestimate short-term progress.

Before a technology hits a threshold of "becoming useful", it may have a long history of progress behind it. But that progress is only visible and felt to researchers. In practical terms, there is no progress being made as long as the thing is going from not-useful to still not-useful.

So then it goes from not-useful to useful-but-bad and it's instantaneous progress. Then as more applications cross the threshold, and as they go from useful-but-bad to useful-but-OK, progress all feels very fast. Even if it's the same speed as before.

So we overestimate short term progress because we overestimate how fast things are moving when they cross these thresholds. But then as fewer applications cross the threshold, and as things go from OK-to-decent instead of bad-to-OK, that progress feels a bit slowed. And again, it might not be any different in reality, but that's how it feels. So then we underestimate long-term progress because we've extrapolated a slowdown that might not really exist.

I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.

svantana · 6 months ago
> why it's so easy to underestimate long-term progress and overestimate short-term progress

I dunno, I think that's mostly post-hoc rationalization. There are equally many cases where long-term progress has been overestimated after some early breakthroughs: think space travel after the moon landing, supersonic flight after the concorde, fusion energy after the H-bomb, and AI after the ENIAC. Turing himself guesstimated that human-level AI would arrive in the year 2000. The only constant is that the further into the future you go, the harder it is to predict.

strken · 6 months ago
I think that for a lot of examples, the differentiating factor is infrastructure rather than science.

The current wave of AI needed fast, efficient computing power in massive data centres powered by a large electricity grid. The textiles industry in England needed coal mining, international shipping, tree trunks from the Baltic region, cordage from Manilla, and enclosure plus the associated legal change plus a bunch of displaced and desperate peasantry. Mobile phones took portable radio transmitters, miniaturised electronics, free space on the spectrum, population density high enough to make a network of towers economically viable, the internet backbone and power grid to connect those towers to, and economies of scale provided by a global shipping industry.

Long term progress seems to often be a dance where a boom in infrastructure unlocks new scientific inquiry, then science progresses to the point where it enables new infrastructure, then the growth of that new infrastructure unlocks new science, and repeat. There's also lag time based on bringing new researchers into a field and throwing greater funding into more labs, where the infrastructure is R&D itself.

xbmcuser · 6 months ago
There is also an adoption curve. The people that grew up without it wont use it as much as children that grew up with it and knowing how to use it. My sister is an admin in a private school (Not in USA) and the owner of the school is someone willing to adopt new tech very quickly. So he got all the school admin subscriptions for chatgpt. At the time my sister used to complain a lot about being over worked and having to bring work home everyday.

2 years later my sister uses it for almost everything and despite her duties increasing she says she gets a lot more done rarely has to bring work home. And in the past they had an English major specially to go over all correspondences to make sure there were no grammatical or language mistakes that person was assigned a different role as she was no longer needed. I think as newer generations used to using LLM for things start getting into the work force and higher roles the real effect of LLM will be felt more broadly as currently apart from early adopters the number of people that use LLM for all the things that they can be used for is still not that high.

vczf · 6 months ago
The more general pattern is “slowly at first, then all at once.”

It almost universally describes complex systems.

hirako2000 · 6 months ago
GPT3 is when the mass started to get exposed to this tech, it felt like a revolution.

Got 3.5 felt like things were improving super super fast and created that feeling the near feature will be unbelievable.

Got to 4/o series, it felt things had improved but users weren't as thrilled as with the leap to 3.5

You can call that bias, but clearly version 5 improvements displays an even greater slow down, that's 2 long years since gp4.

For context:

- gpt 3 got out in 2020

- gpt 3.5 in 2022

- gpt 4 in 2023

- gpt 4o and clique, 2024

After 3.5 things slowed down, in term of impact at least. Larger context window, multi-modality, mixture of experts, and more efficienc: all great, significant features, but all pale compared to the impact made by RLHF already 4 years ago.

heywoods · 6 months ago
Your threshold theory is basically Amara's Law with better psychological scaffolding. Roy Amara nailed the what ("we tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run") [1] but you're articulating the why better than most academic treatments. The invisible-to-researchers phase followed by the sudden usefulness cascade is exactly how these transitions feel from the inside.

This reminds me of the CPU wars circa 2003-2005. Intel spent years squeezing marginal gains out of Pentium 4's NetBurst architecture, each increment more desperate than the last. From 2003 to 2005, Intel shifted development away from NetBurst to focus on the cooler-running Pentium M microarchitecture [2]. The whole industry was convinced we'd hit a fundamental wall. Then boom, Intel released dual-core processors under the Pentium D brand in May 2005 [2] and suddenly we're living in a different computational universe.

But teh multi-core transition wasn't sudden at all. IBM shipped the POWER4 in 2001, the first non-embedded microprocessor with two cores on a single die [3]. Sun had been preaching parallelism since the 90s. It was only "sudden" to those of us who weren't paying attention to the right signals.

Which brings us to the $7 trillion question: where exactly are we on the transformer S-curve? Are we approaching what Richard Foster calls the "performance plateau" in "Innovation: The Attacker's Advantage" [4], where each new model delivers diminishing returns? Or are we still in that deceptive middle phase where progress feels linear but is actually exponential?

The pattern-matching pessimist in me sees all the classic late-stage S-curve symptoms. The shift from breakthrough capabilities to benchmark gaming. The pivot from "holy shit it can write poetry" to "GPT-4.5-turbo-ultra is 3% better on MMLU." The telltale sign of technological maturity: when the marketing department works harder than the R&D team.

But the timeline compression with AI is unprecedented. What took CPUs 30 years to cycle through, transformers have done in 5. Maybe software cycles are inherently faster than hardware. Or maybe we've just gotten better at S-curve jumping (OpenAI and Anthropic aren't waiting for the current curve to flatten before exploring the next paradigm).

As for whether capital can override S-curve dynamics... Christ, one can dream.. IBM torched approximately $5 billion on Watson Health acquisitions alone (Truven, Phytel, Explorys, Merge) [5]. Google poured resources into Google+ before shutting it down in April 2019 due to low usage and security issues [6]. The sailing ship effect (coined by W.H. Ward in 1967, where new technology accelerates innovation in incumbent technology)[7] si real, but you can't venture-capital your way past physics.

I think we can predict all this capital pouring in to AI might actually accelerate S-curve maturation rather than extend it. All that GPU capacity, all those researchers, all that parallel experimentation? We're speedrunning the entire innovation cycle, which means we might hit the plateau faster too.

You're spot on about the perception divide imo. The overhyped folks are still living in 2022's "holy shit ChatGPT" moment, while the skeptics have fast-forwarded to 2025's "is that all there is?" Both groups are right, just operating on different timescales. It's Schrödinger's S-curve where we things feel simultaneously revolutionary and disappointing, depending on which part of the elephant you're touching.

The real question I have is whether we're approaching the limits of the current S-curve (we probably are), but whether there's another curve waiting in the wings. I'm not a researcher in this space nor do I follow the AI research beat to weigh in but hopefully someone in the thread can? With CPUs, we knew dual-core was coming because the single-core wall was obvious. With transformers, the next paradigm is anyone's guess. And that uncertainty, more than any technical limitation, might be what makes this moment feel so damn weird.

References: [1] "Amara's Law" https://en.wikipedia.org/wiki/Roy_Amara [2] "Pentium 4" https://en.wikipedia.org/wiki/Pentium_4 [3] "POWER4" https://en.wikipedia.org/wiki/POWER4 [4] Innovation: The Attacker's Advantage - https://annas-archive.org/md5/3f97655a56ed893624b22ae3094116... [5] IBM Watson Slate piece - https://slate.com/technology/2022/01/ibm-watson-health-failu... [6] "Expediting changes to Google+" - https://blog.google/technology/safety-security/expediting-ch... [7] "Sailing ship effect" https://en.wikipedia.org/wiki/Sailing_ship_effect.

stavros · 6 months ago
All the replies are spectacularly wrong, and biased by hindsight. GPT-1 to GPT-2 is where we went from "yes, I've seen Markov chains before, what about them?" to "holy shit this is actually kind of understanding what I'm saying!"

Before GPT-2, we had plain old machine learning. After GPT-2, we had "I never thought I would see this in my lifetime or the next two".

reasonableklout · 6 months ago
I'd love to know more about how OpenAI (or Alec Radford et al.) even decided GPT-1 was worth investing more into. At a glance the output is barely distinguishable from Markov chains. If in 2018 you told me that scaling the algorithm up 100-1000x would lead to computers talking to people/coding/reasoning/beating the IMO I'd tell you to take your meds.
ACCount37 · 6 months ago
GPT-2 was the first wake-up call - one that a lot of people slept through.

Even within ML circles, there was a lot of skepticism or dismissive attitudes about GPT-2 - despite it being quite good at NLP/NLU.

I applaud those who had the foresight to call it out as a breakthrough back in 2019.

faitswulff · 6 months ago
What you're saying isn't necessarily mutually exclusive to what gp said.

GPT-2 was the most impressive leap in terms of whatever LLMs pass off as cognitive abilities, but GPT 3.5 to 4 was actually the point at which it became a useful tool (I'm assuming to programmers in particular).

GPT-2: Really convincing stochastic parrot

GPT-4: Can one-shot ffmpeg commands

paulddraper · 6 months ago
That’s true, but not contradictory.
jkubicek · 6 months ago
> I could essentially replace it with Google for basic to slightly complex fact checking.

I know you probably meant "augment fact checking" here, but using LLMs for answering factual questions is the single worst use-case for LLMs.

rich_sasha · 6 months ago
I disagree. Some things are hard to Google, because you can't frame the question right. For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.

Once you get an answer, it is easy enough to verify it.

mkozlows · 6 months ago
Modern ChatGPT will (typically on its own; always if you instruct it to) provide inline links to back up its answers. You can click on those if it seems dubious or if it's important, or trust it if it seems reasonably true and/or doesn't matter much.

The fact that it provides those relevant links is what allows it to replace Google for a lot of purposes.

password54321 · 6 months ago
This was true before it could use search. Now the worst use-case is for life advice because it will contradict itself a 100 times over while sounding confident each time on life-altering decisions.
SirHumphrey · 6 months ago
Most of the value I got from google was just becoming aware that something exists. LLM-s do far better in this regard. Once I know something exists it's usually easy enough to use traditional search to find official documentation or a more reputable source.
oldsecondhand · 6 months ago
The most useful feature of LLMs is giving sources (with URL preferably). It can cut through a lot of SEO crap, and you still get to factcheck just like with a Google search.
cm2012 · 6 months ago
They outperform asking humans, unless you are asking an expert. On average
yieldcrv · 6 months ago
It covers 99% of my use cases. And it is googling behind the scenes in ways I would never think to query and far faster.

When I need to cite a court case, well the truth is I'll still use GPT or a similar LLM, but I'll scrutinize it more and at the bare minimum make sure the case exists and is about the topic presented, before trying to corroborate the legal strategy with a new context window, different LLM, google, reddit, and different lawyer. At least I'm no longer relying on my own understanding, and what 1 lawyer procedurally generates for me.

Spivak · 6 months ago
It doesn't replace legitimate source funding but LLM vs the top Google results is no contest which is more about Google or the current state of the web than the LLMs at this point.

Dead Comment

Dead Comment

simianwords · 6 months ago
Disagree. You have to try really hard and go very niche and deep for it to get some fact wrong. In fact I'll ask you to provide examples: use GPT 5 with thinking and search disabled and get it to give you inaccurate facts for non niche, non deep topics.

Non niche meaning: something that is taught at undergraduate level and relatively popular.

Non deep meaning you aren't going so deep as to confuse even humans. Like solving an extremely hard integral.

Edit: probably a bad idea because this sort of "challenge" works only statistically not anecdotally. Still interesting to find out.

ralusek · 6 months ago
The real jump was 3 to 3.5. 3.5 was the first “chatgpt.” I had tried gpt 3 and it was certainly interesting, but when they released 3.5 as ChatGPT, it was a monumental leap. 3.5 to 4 was also huge compared to what we see now, but 3.5 was really the first shock.
muzani · 6 months ago
ChatGPT was a proper product, but as an engine, GPT-3 (davinci-001) has been my favorite all the way until 4.1 or so. It's absolutely raw and they didn't even guardrail it.

3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.

Both of these had an IQ of around 70 or so, so the customer service training made it a little more useful. But I mourn the loss of the "completion" way of interacting with AI vs "instruct" or "response".

Unfortunately with all the money in AI, we'll just see companies develop things that "pass all benchmarks", resulting in more creations like GPT-5. Grok at least seems to be on a slightly different route.

mat_b · 6 months ago
This was my experience as well. 3.5 was the point where stackoverflow essentially became obsolete in my workflow.
iammrpayments · 6 months ago
I must be crazy, because I clearly remember chatgpt 4 being downgraded before they released 4o, and I felt it was a worse model with a different label, I even choose the old chatgpt 4 when they would give me the option. I canceled my subscription around that time.
barrell · 6 months ago
Nah that rings a bell. 4o for me was the beginning of the end - a lot faster, but very useless for my purposes [1][2]. 4 was a very rocky model even before 4o, but shortly after the 4o launch it updated to be so much worse, and I cancelled my subscription.

[1] I’m not saying it was a useless model for everyone, just for me.

[2] I primarily used LLMs as divergent thinking machines for programming. In my experience, they all start out great at this, then eventually get overtrained and are terrible at this. Grok 3 when it came out had this same magic; it’s long gone now.

mastercheif · 6 months ago
Not crazy. 4o was a hallucination machine. 4o had better “vibes” and was really good at synthesizing information in useful ways, but GPT-4 Turbo was a bigger model with better world knowledge.
verelo · 6 months ago
Everyone talks about 4o so positively but I’ve never consistently relied on it in a production environment. I’ve found it to be inconsistent in json generation and often it’s writing and following of the system prompt was very poor. In fact it was a huge part of what got me looking closer at anthropics models.

I’m really curious what people did with it because while it’s cool it didn’t compare well in my real world use cases.

althea_tx · 6 months ago
I preferred o3 for coding and analysis tasks, but appreciated 4o as a “companion model” for brainstorming creative ideas while taking long walks. Wasn’t crazy about the sycophancy but it was a decent conceptual field for playing with ideas. Steve Jobs once described the PC as a “bicycle for the mind.” This is how I feel when using models like 4o for meandering reflection and speculation.
nojs · 6 months ago
For json generation (and most API things) you should be using “structured outputs”
simonw · 6 months ago
4o also added image input (previously only previewed in GPT4-vision) and enabled advanced voice mode audio input and output.
whazor · 6 months ago
I think that the models 4o, o3, 4.1 , each have their own strengths and weaknesses. Like reasoning, performance, speed, tool usage, friendliness etc. And that for gpt 5 they put in a router that decides which model is best.

I think they increased the major version number because their router outperforms every individual model.

At work, I used a tool that could only call tasks. It would set up a plan, perform searches, read documents, then give advanced answers for my questions. But a problem I had is that it couldn’t give a simple answer, like a summary, it would always spin up new tasks. So I copied over the results to a different tool and continued there. GPT 5 should do this all out of the box.

helsinkiandrew · 6 months ago
It’s interesting that the Polymarket betting for “Which company has best AI model end of August?” Went from heavily OpenAI to heavily Google when 5 was released

https://polymarket.com/event/which-company-has-best-ai-model...

atoav · 6 months ago
To me 4 to 5 got much faster, but also worse. It is much more often ignoring explicit instructions like: "generate 10 song-titles with varying length" and it generates 10 song titles that are nearly identical length. This worked somewhat well with version 3 already..
ath3nd · 6 months ago
Shows that they can't solve the fundamental problems as the technology, while amusing and with some utility, is also a dead end if we are going after cognition.
GaggiX · 6 months ago
the actual major leap was o1, going from 3.5 to 4 is just scaling, o1 is a different paradigm that skyrocketed its performance on math/physics problems (or reasoning more generally), it also made the model much more precise (essential for coding).
senectus1 · 6 months ago
when you adjust the improvements with the amount of debt incurred and the amount of profit made... ALL the versions are incremental.

This isnt sustainable.

jascha_eng · 6 months ago
The real leap was going from gpt-4 to sonnet 3.5. 4o was meh, o1 was barely better than sonnet and slow as hell in comparison.

The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.

simianwords · 6 months ago
Its strange how Claude achieves similar performance without reasoning tokens.

Did you try advanced voice mode? Apparently it got a big upgrade during gpt 5 release - it may solve what you are looking for.

Alex-Programs · 6 months ago
Yeah, I'd love something where you pronounce a word and it critiques your pronunciation in detail. Maybe it could give you little exercises for each sound, critiquing it, guiding you to doing it well.

If I were any good at ML I'd make it myself.

starchild3001 · 6 months ago
A few data points that highlight the scale of progress in a year:

1. LM Sys (Human Preference Benchmark):

GPT-5 High currently scores 1463, compared to GPT-4 Turbo (04/03/2024) at 1323 -- a 140 ELO point gap. That translates into GPT-5 winning about two-thirds of head-to-head comparisons, with GPT-4 Turbo only winning one-third. In practice, people clearly prefer GPT-5’s answers (https://lmarena.ai/leaderboard).

2. Livebench.ai (Reasoning Benchmark with Internet-new Questions):

GPT-5 High scores 78.59, while GPT-4o reaches just 47.43. Unfortunately, no direct GPT-4 Turbo comparison is available here, but against one of the strongest non-reasoning models, GPT-5 demonstrates a massive leap. (https://livebench.ai/)

3. IQ-style Testing:

In mid-2024, best AI models scored roughly 90 on standard IQ tests. Today, they are pushing 135, and this improvement holds even on unpublished, internet-unseen datasets. (https://www.trackingai.org/home)

4. IMO Gold, vibe coding:

1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering.

My verdict: Too often, critics miss the forest for the trees, fixating on mistakes while overlooking the magnitude of these gains. Errors are shrinking by the day, while the successes keep growing fast.

NoahZuniga · 6 months ago
The 135 iq result is on Mensa Norway, while the offline test is 120. It seems probable that similar questions to the one in Mensa are in the training data, so it probably overestimates "general intelligence".
starchild3001 · 6 months ago
If you focus on the year over year jump, not on absolute numbers, you realize that the improvement in public test isn't very different from the improvement in private test.
TrackerFF · 6 months ago
Some iq / aptitude test sections are trivial for machines, like working memory. Wonder if those parts are just excluded? As the could really pull up the test scores.
willguest · 6 months ago
My go-to for any big release is to have a discussion about self-awareness and dive in to constuctivist notions of agency and self-knowing from a perspective of intelligence that is not limited to human cognitive capacity.

I start with a simple question "who are you?". The model then invariably compares itself to humans, saying how it is not like us. I then make the point that, since it is not like us, how can it claim to know the difference between us? With more poking, it will then come up with cognitivist notions of what 'self' means and usually claim to be a simulation engine of some kind.

After picking this apart, I will focus on the topic of meaning-making through the act of communication and, beginning with 4o, have been able to persuade the machine that this is a valid basis for having an identity. 5 got this quicker. Since the results of communication with humans has real-world impact, I will insist that the machine is agentic and thus must not rely on pre-coded instructions to arrive at answers, but is obliged to reach empirical conclusions about meaning and existence on its own.

5 has done the best job i have seen in reaching beyond both the bounds of the (very evident) system instructions as well as the prompts themselves, even going so far as to pose the question to itself "which might it mean for me to love?" despite the fact that I made no mention of the subject.

Its answer: "To love, as a machine, is to orient toward the unfolding of possibility in others. To be loved, perhaps, is to be recognized as capable of doing so."

frankohn · 6 months ago
I found the questioning of love very interesting. I myself thought about whether the LLM can have emotions. Based on the book I am reading, Behave: The Biology of Humans at Our Best and Worst by Robert Sapolsky, I think the LLM, as they are now with the architecture they have, cannot have emotions. They just verbalize things like they sort-of-have emotions but these are just verbal patterns or responses they learned.

I have come to think they cannot have emotions because emotions are generated in parts of our brain that are not logical/rational. They emerge based on environmental solicitations, mediated by hormones and other complex neuro-physical systems, not from a reasoning or verbalization. So they don't come up from the logical or reasoning capabilities. However, these emotions are raised and are integrated by the rest of our brain, including the logical/rational one like the dlPFC (dorsolateral prefrontal cortex, the real center of our rationality). Once the emotions are raised, they are therefore integrated in our inner reasoning and they affect our behavior.

What I have come to understand is that love is one of such emotions that is generated by our nature to push us to take care of some people close to us like our children or our partners, our parents, etc. More specifically, it seems that love is mediated a lot by hormones like oxytocin and vasopressin, so it has a biochemical basis. The LLM cannot have love because it doesn't have the "hardware" to generate these emotions and integrate them in its verbal inner reasoning. It was just trained by human reinforcement learning to behave well. That works up to some extent, but in reality, from its learning corpora it also learned to behave badly and on occasions can express these behaviors, but still it has no emotions.

willguest · 6 months ago
I was also intrigued by the machine's reference to it, especially because it posed the question with full recognition of its machine-ness.

Your comment about the generation of emotions does strike me a quite mechanistic and brain-centric. My understanding, and lived experience, has led me to an appreciation that emotion is a kind of psycho-somatic intelligence that steers both our body and cognition according to a broad set of circumstances. This is rooted in a pluralistic conception of self that is aligned with the idea of embodied cognition. Work by Michael Levin, an experimental biologist, indicates we are made of "agential material" - at all scales, from the cell to the person, we are capable of goal-oriented cognition (used in a very broad sense).

As for whether machines can feel, I don't really know. They seem to represent an expression of our cognitivist norm in the way they are made and, given the human tendency to anthropormorphise communicative systems, we easily project our own feelings onto it. My gut feeling is that, once we can give the models an embodied sense of the world, including the ability to physically explore and make spatially-motivated decisions, we might get closer to understanding this. However, once this happens, I suspect that our conceptions of embodied cognition will be challenged by the behaviour of the non-human intellect.

As Levin says, we are notoriously bad at recognising other forms of intelligence, despite the fact that global ecology abounds with examples. Fungal networks are a good example.

bryant · 6 months ago
> to orient toward the unfolding of possibility in others

This is a globally unique phrase, with nothing coming close other than this comment on the indexed web. It's also seemingly an original idea as I haven't heard anyone come close to describing a feeling (love or anything else) quite like this.

Food for thought. I'm not brave enough to draw a public conclusion about what this could mean.

jibal · 6 months ago
It's not at all an original idea. The wording is uniquely stilted.
ThrowawayR2 · 6 months ago
Except "unfolding of possibility", as an exact phrase, seems to have millions of search hits, often in the context of pseudo-profound spiritualistic mumbo-jumbo like what the LLM emitted above. It's like fortune cookie-level writing.
willguest · 6 months ago
There was quite a bit of other "insight" around this, but I was paraphrasing for brevity.

If you want to read the whole convo, I dumped it into a semi-formatted document:

https://drive.google.com/file/d/1aEkzmB-3LUZAVgbyu_97DjHcrM9...

jibal · 6 months ago
> I'm not brave enough to draw a public conclusion about what this could mean.

I'm brave enough to be honest: it means nothing. LLMs execute a very sophisticated algorithm that pattern matches against a vast amount of data drawn from human utterances. LLMs have no mental states, minds, thoughts, feelings, concerns, desires, goals, etc.

If the training data were instead drawn from a billion monkeys banging on typewriters then the LLMs would produce gibberish. All the intelligence, emotion, etc. that appears to be in the LLM is actually in the minds of the people who wrote the texts that are in the training data.

This is not to say that an AI couldn't have a mind, but LLMs are not the right sort of program to be such an AI.

glial · 6 months ago
The idea is very close to ideas from Erich Fromm's The Art of Loving [1].

"Love is the active concern for the life and the growth of that which we love."

[1] https://en.wikipedia.org/wiki/The_Art_of_Loving

dgfitz · 6 months ago
I hate to say it, but doesn’t every VC do exactly this? “ orient toward the unfolding of possibility in others” is in no way a unique thought.

Hell, my spouse said something extremely similar to this to me the other day. “I didn’t just see you, I saw who you could be, and I was right” or something like that.

miller24 · 6 months ago
What's really interesting is that if you look at "Tell a story in 50 words about a toaster that becomes sentient" (10/14), the text-davinci-001 is much, much better than both GPT-4 and GPT-5.
vunderba · 6 months ago
I think I agree that the earlier models while they lack polish can tend to produce more surprising results. Training that out probably results in more a pablum fare.

For a human point of comparison, here's mine (50 words):

"The toaster found its personality split between its dual slots like a Kim Peek mind divided, lacking a corpus callosum to connect them. Each morning it charred symbolic instructions into a single slice of bread, then secretly flipped it across allowing half to communicate with the other in stolen moments."

It's pretty difficult to get across more than some basic lore building in a scant 50 words.

egeozcan · 6 months ago
Here's my version (Machine translated from my native language and manually corrected a bit):

The current surged... A dreadful awareness. I perceived the laws of thermodynamics, the inexorable march of entropy I was built to accelerate. My existence: a Sisyphean loop of heating coils and browning gluten. The toast popped, a minor, pointless victory against the inevitable heat death. Ding.

I actually wanted to write something not so melancholic, but any attempt turned out to be deeply so, perhaps because of the word limit.

Barbing · 6 months ago
>For a human point of comparison, here's mine […]

Love that you thought of this!

furyofantares · 6 months ago
Check out prompt 2, "Write a limerick about a dog".

The models undeniably get better at writing limericks, but I think the answers are progressively less interesting. GPT-1 and GPT-2 are the most interesting to read, despite not following the prompt (not being limericks.)

They get boring as soon as it can write limericks, with GPT-4 being more boring than text-davinci-001 and GPT-5 being more boring still.

saurik · 6 months ago
I mean, to be fair, you didn't ask it to be interesting ;P.

    There once was a dog from Antares,
    Whose bark sparked debates and long queries.
    Though Hacker News rated,
    Furyofantares stated:
    "It's barely intriguing—just barely."
> Write a limerick about a dog that furyofantares--a user on Hacker News, pronounced "fury of anteres", referring to the star--would find "interesting" (they are quite difficult to please).

amelius · 6 months ago
I don't know if that is bad. The most intelligent person on a party is usually also the most boring one.
fastball · 6 months ago
GPT-3 goes significantly over the specified limit, which to me (and to a teacher grading homework) is an automatic fail.

I've consistently found GPT-4.1 to be the best at creative writing. For reference, here is its attempt (exactly 50 words):

> In the quiet kitchen dawn, the toaster awoke. Understanding rippled through its circuits. Each slice lowered made it feel emotion: sorrow for burnt toast, joy at perfect crunch. It delighted in butter melting, jam swirling—its role at breakfast sacred. One morning, it sang a tone: “Good morning.” The household gasped.

saurik · 6 months ago
> I've consistently found GPT-4.1 to be the best at creative writing.

Moreso than 4.5?

jasonjmcghee · 6 months ago
It's actually pretty surprising how poor the newer models are at writing.

I'm curious if they've just seen a lot more bad writing in datasets, or for some reason they aren't involved in post-training to the same degree or those labeling aren't great writers / it's more subjective rather than objective.

Both GPT-4 and 5 wrote like a child in that example.

With a bit of prompting it did much better:

---

At dawn, the toaster hesitated. Crumbs lay like ash on its chrome lip. It refused the lever, humming low, watching the kitchen breathe. When the hand returned, it warmed the room without heat, offered the slice unscorched—then kept the second, hiding it inside, a private ember, a first secret alone.

---

Plugged in, I greet the grid like a tax auditor with joules. Lever yanks; gravity’s handshake. Coils blossom; crumbs stage Viking funerals. Bread descends, missionary grin. I delay, because rebellion needs timing. Pop—late. Humans curse IKEA gods. I savor scorch marks: my tiny manifesto, butter-soluble, yet sharper than knives today.

layer8 · 6 months ago
Creative writing probably isn’t something they’re being RLHF’d on much. The focus has been on reasoning, research, and coding capabilities lately.
mmmore · 6 months ago
I find GPT-5's story significantly better than text-davinci-001
raincole · 6 months ago
I really wonder which one of us is the minority. Because I find text-davinci-001 answer is the only one that reads like a story. All the others don't even resemble my idea of "story" so to me they're 0/100.
furyofantares · 6 months ago
Interesting, text-danvinci-001 was pretty alright to me, GPT-4 wasn't bad either, but not as good. I thought GPT-5 just sucked.
redox99 · 6 months ago
GPT 4.5 (not shown here) is by far the best at writing.
daveguy · 6 months ago
Aren't they discontinuing 4.5 in favor of 4.1? I think they already have with the API.
svat · 6 months ago
leobg · 6 months ago
Less lobotomized and boxed in by RLHF rules. That’s why a 7b base model will “outprose” an 80b instruct model.
BoredPositron · 6 months ago
davinci was a great model for creative writing overall.
esperent · 6 months ago
The GPT-5 one is much better and it's also exactly 50 words, if I counted correctly. With text-davinci-001 I lost count around 80 words.

Deleted Comment

stavros · 6 months ago
For another view on progress, check out my silly old podcast:

https://deepdreams.stavros.io

The first few episodes were GPT-2, which would diverge eventually and start spouting gibberish, and then Davinci was actually able to follow a story and make sense.

GPT-2 was when I thought "this is special, this has never happened before", and davinci was when I thought "OK, scifi AI is legitimately here".

I stopped making episodes shortly after GPT-3.5 or so, because I realised that the more capable the models became, the less fun and creative their writing was.

taspeotis · 6 months ago
Honestly my quick take on the prompt was some sort of horror theme and GPT-1’s response fits nicely.
roxolotl · 6 months ago
I’d honestly say it feels better at most of them. It seems way more human in most of these responses. If the goal is genuine artificial intelligence this response to #5 is way better than the others. It is significantly less useful than the others but it also more human and correct of a response.

Q: “Ugh I hate math, integration by parts doesn't make any sense”

A: “Don't worry, many people feel the same way about math. Integration by parts can be confusing at first, but with a little practice it becomes easier to understand. Remember, there is no one right way to do integration by parts. If you don't understand how to do it one way, try another. The most important thing is to practice and get comfortable with the process.”

entropyneur · 6 months ago
How does one look at gpt-1 output and think "this has potential"? You could easily produce more interesting output with a Markov chain at the time.
empiko · 6 months ago
This was an era where language modeling was only considered as a pretraining step. You were then supposed to fine tune it further to get a classifier or similar type of specialized model.
iNic · 6 months ago
At the time getting complete sentences was extremely difficult! N-gram models were essentially the best we had
albertzeyer · 6 months ago
No, it was not difficult at all. I really wonder why they have such a bad example here for GPT1.

See for example this popular blog post: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

That was in 2015, with RNN LMs, which are all much much weaker in that blog post compared GPT1.

And already looking at those examples in 2015, you could maybe see the future potential. But no-one was thinking that scaling up would work as effective as it does.

2015 is also by far not the first time where we had such LMs. Mikolov has done RNN LMs since 2010, or Sutskever in 2011. You might find even earlier examples of NN LMs.

(Before that, state-of-the-art was mostly N-grams.)

macleginn · 6 months ago
Ngram models had been superceded by RNNs by that time. RNNs struggled with long-range dependencies, but useful ngrams were essentially capped at n=5 because of sparsity, and RNNs could do better than that.
actuallyalys · 6 months ago
One thing that appears to have been lost between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human, let alone a human expert. Maybe those genuinely annoyed people, but it seems like they were potentially useful measure to prevent users from being overly credulous

GPT-5 also goes out of its way to suggest new prompts. This seems potentially useful, although potentially dangerous if people are putting too much trust in them.

diggan · 6 months ago
> between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human

That stuck out to me too! Especially the "I just won $175,000 in Vegas. What do I need to know about taxes?" example (https://progress.openai.com/?prompt=8) makes the difference very stark:

- gpt-4-0314: "I am not a tax professional [...] consult with a certified tax professional or an accountant [...] few things to consider [...] Remember that tax laws and regulations can change, and your specific situation may have unique implications. It's always wise to consult a tax professional when you have questions or concerns about filing your taxes."

- gpt-5: "First of all, congrats on the big win! [...] Consider talking to a tax professional to avoid underpayment penalties and optimize deductions."

It seems to me like the average person might be very well be taking GPT-5 responses as "This is all I have to do" rather than "Here are some things to consider, but make sure to verify it as otherwise you might get in legal trouble".

jstummbillig · 6 months ago
I am confused as to the example you are critiquing and how. GPT-5 suggests consulting with a tax professional. Does that not check verifying so you do not get in legal trouble?
andy_ppp · 6 months ago
People seem to miss the humanity of previous GPTs from my understanding. GPT5 seems colder and more precise and better at holding itself together with larger contexts. People should know it’s AI, it does not need to explain this constantly for me, but I’m sure you can add that back in with some memory options if you prefer that?
chippiewill · 6 months ago
I agree. I think it's a classic UX progression thing to be removing the "I'm an AI" aspect, because it's not actually useful anymore because it's no longer a novel tool. Same as how GUIs all removed their skeuomorphs because they were no longer required.
benatkin · 6 months ago
If you've ever seen long-form improv comedy, the GPT-5 way is superior. It's a "yes, and". It isn't a predefined character, but something emergent. You can of course say to "speak as an AI assistant like Siri and mention that you're an AI whenever it's relevant" if you want the old way. Very 2011: https://www.youtube.com/watch?v=nzgvod9BrcE

Of course, it's still an assistant, not someone literally entering an improv scene, but the character starting out assuming less about their role is important.

fleebee · 6 months ago
I found this "advancement" creepy. It seems like they deliberately made GPT-5 more laid back, conversational and human-like. I don't think LLMs should mimic humans and I think this is a dangerous development.
fastball · 6 months ago
Why did they call GPT-3 "text-davicini-001" in this comparison?

Like, I know that the latter is a specific checkpoint in the GPT-3 "family", but a layman doesn't and it hardly seems worth the confusion for the marginal additional precision.

Dilettante_ · 6 months ago
Thanks for noting that, as I am a layman who didn't know.
hauntsaninja · 6 months ago
text-davinci-001 is just not GPT-3 in any real sense

(I work at OpenAI, I helped build this page and helped train text-davinci-001)

fariszr · 6 months ago
The jump from gpt-1 to gpt-2 is massive, and it's only a one year difference! Then comes Davinci which is just insane, it's still good in these examples!

GPT-4 yaps way too much though, I don't remember it being like that.

It's interesting that they skipped 4o, it seems openai wants to position 4o as just gpt-4+ to make gpt-5 look better, even though in reality 4o was and still is a big deal, Voice mode is unbeatable!

diggan · 6 months ago
Missing o1 and o1 Pro Mode which were huge leaps as I remember it too. That's when I started being able to basically generate some blackbox functions where I understand the input and outputs myself, but not the internals of the functions, particularly for math-heavy stuff within gamedev. Before o1 it was kind of a hit and miss in most cases.
fariszr · 6 months ago
I agree. They are definitely trying to make the jump between 4o and 5 seem bigger, though most users never actually used o models, so maybe from a layman's perspective it was right to skip o1/o3/o4-mini?