Lessons after a Half-billion GPT Tokens

The team I work on processes 5B+ tokens a month (and growing) and I'm the EM overseeing that.

Here are my take aways

1. There are way too many premature abstractions. Langchain, as one of may examples, might be useful in the future but at the end of the day prompts are just a API call and it's easier to write standard code that treats LLM calls as a flaky API call rather than as a special thing.

2. Hallucinations are definitely a big problem. Summarizing is pretty rock solid in my testing, but reasoning is really hard. Action models, where you ask the llm to take in a user input and try to get the llm to decide what to do next, is just really hard, specifically it's hard to get the llm to understand the context and get it to say when it's not sure.

That said, it's still a gamechanger that I can do it at all.

3. I am a bit more hyped than the author that this is a game changer, but like them, I don't think it's going to be the end of the world. There are some jobs that are going to be heavily impacted and I think we are going to have a rough few years of bots astroturfing platforms. But all in all I think it's more of a force multiplier rather than a breakthrough like the internet.

IMHO it's similar to what happened to DevOps in the 2000s, you just don't need a big special team to help you deploy anymore, you hire a few specialists and mostly buy off the shelf solutions. Similarly, certain ML tasks are now easy to implement even for dumb dumb web devs like me.

tmpz22 · 2 years ago

> IMHO it's similar to what happened to DevOps in the 2000s, you just don't need a big special team to help you deploy anymore, you hire a few specialists and mostly buy off the shelf solutions.

I advocate for these metaphors to help people better understand a reasonable expectation for LLMs in modern development workflows. Mostly because they show it as a trade-off versus a silver bullet. There were trade-offs to the evolution of devops, consider for example the loss of key skillsets like database administration as a direct result of "just use AWS RDS" and the explosion in cloud billing costs (especially the OpEx of startups who weren't even dealing with that much data or regional complexity!) - and how it indirectly led to Gitlabs big outage and many like it.

checkyoursudo · 2 years ago

> get it to say when it's not sure

This is a function of the language model itself. By the time you get to the output, the uncertainty that is inherent in the computation is lost to the prediction. It is like if you ask me to guess heads or tails, and I guess heads, I could have stated my uncertainty (e.g. Pr [H] = .5) before hand, but in my actual prediction of heads, and then the coin flip, that uncertainty is lost. It's the same with LLMs. The uncertainty in the computation is lost in the final prediction of the tokens, so unless the prediction itself is uncertainty (which it should rarely be based on the training corpus, I think), then you should not find an LLM output really ever to say it does not understand. But that is because it never understands, it just predicts.

taneq · 2 years ago

It's not just loss of the uncertainty in prediction, it's also that an LLM has zero insight into its own mental processes as a separate entity from its training data and the text it's ingested. If you ask it how sure it is, the response isn't based on its perception of its own confidence in the answer it just gave, it's based on how likely it is for an answer like that to be followed by a confident affirmation in its training data.

moozilla · 2 years ago

Apparently it is possible to measure how uncertain the model is using logprobs, there's a recipe for it in the OpenAI cookbook: https://cookbook.openai.com/examples/using_logprobs#5-calcul...

I haven't tried it myself yet, not sure how well it works in practice.

nequo · 2 years ago

But the LLM predicts the output based on some notion of a likelihood so it could in principle signal if the likelihood of the returned token sequence is low, couldn’t it?

Or do you mean that fine-tuning distorts these likelihoods so models can no longer accurately signal uncertainty?

brookst · 2 years ago

I get the reasoning but I’m not sure you’ve successfully contradicted the point.

Most prompts are written in the form “you are a helpful assistant, you will do X, you will not do Y”

I believe that inclusion of instructions like “if there are possible answers that differ and contradict, state that and estimate the probability of each” would help knowledgeable users.

But for typical users and PR purposes, it would be disaster. It is better to tell 999 people that the US constitution was signed in 1787 and 1 person that it was signed in 349 B.C. than it is to tell 1000 people that it was probably signed in 1787 but it might have been 349 B.C.

xiphias2 · 2 years ago

> so unless the prediction itself is uncertainty (which it should rarely be based on the training corpus, I think)

Why shouldn't you ask for uncertainaty?

I love asking for scores / probabilities (usually give a range, like 0.0 to 1.0) whenever I ask for a list, and it makes the output much more usable

lordofmoria · 2 years ago

OP here - I had never thought of the analogy to DevOps before, that made something click for me, and I wrote a post just now riffing off this notion: https://kenkantzer.com/gpt-is-the-heroku-of-ai

Basically, I think we’re using GPT as the PaaS/heroku/render equivalent of AI ops.

Thank you for the insight!!

harryp_peng · 2 years ago

You only processed 500m tokens, which is shockingly little. perhaps only 2k in incurred costs?

ryoshu · 2 years ago

> But all in all I think it's more of a force multiplier rather than a breakthrough like the internet.

Thank you. Seeing similar things. Clients are also seeing sticker shock on how much the big models cost vs. the output. That will all come down over time.

nineteen999 · 2 years ago

> That will all come down over time.

So will interest, as more and more people realise theres nothing "intelligent" about the technology, it's merely a Markov-chain-word-salad generator with some weights to improve the accuracy somewhat.

I'm sure some people (other than AI investors) are getting some value out of it, but I've found it to be most unsuited to most of the tasks I've applied it to.

gopher_space · 2 years ago

> Summarizing is pretty rock solid in my testing, but reasoning is really hard.

Asking for analogies has been interesting and surprisingly useful.

eru · 2 years ago

Could you elaborate, please?

mirekrusin · 2 years ago

Regarding null hypothesis and negation problems - I find it personally interesting because similar fenomenon happens in our brains. Dreams, emotions, affirmations etc. process inner dialogue more less by ignoring negations and amplifying emotionally rich parts.

airstrike · 2 years ago

> Summarizing is pretty rock solid in my testing

Yet, for some reason, ChatGPT is still pretty bad at generating titles for chats, and I didn't have better luck with the API even after trying to engineer the right prompt for quite a while...

For some odd reason, once in a while I get things in different languages. It's funny when it's in a language I can speak, but I recently got "Relm4 App Yenileştirme Titizliği" which ChatGPT tells me means "Relm4 App Renewal Thoroughness" when I actually was asking it to adapt a snippet of gtk-rs code to relm4, so not particularly helpful

devdiary · 2 years ago

> at the end of the day prompts are just a API call and it's easier to write standard code that treats LLM calls as a flaky API call

They are also dull (higher latency for same resources) APIs if you're self-hosting LLM. Special attention needed to plan the capacity.

weatherlite · 2 years ago

> Similarly, certain ML tasks are now easy to implement even for dumb dumb web devs like me

For example?

spunker540 · 2 years ago

Lots of applied NLP tasks used to require paying annotators to compile a golden dataset and then train an efficient model on the dataset.

Now, if cost is little concern you can use zero shot prompting on an inefficient model. If cost is a concern, you can use GPT4 to create your golden dataset way faster and cheaper than human annotations, and then train your more efficient model.

Some example NLP tasks could be classifiers, sentiment, extracting data from documents. But I’d be curious which areas of NLP __weren’t__ disrupted by LLMs.

aerhardt · 2 years ago

Anything involving classification, extraction, or synthesis.

motoxpro · 2 years ago

Devops is such an amazing analogy.

Same here: I’m subscribed to all three top dogs in LLM space, and routinely issue the same prompts to all three. It’s very one sided in favor of GPT4 which is stunning since it’s now a year old, although of course it received a couple of updates in that time. Also at least with my usage patterns hallucinations are rare, too. In comparison Claude will quite readily hallucinate plausible looking APIs that don’t exist when writing code, etc. GPT4 is also more stubborn / less agreeable when it knows it’s right. Very little of this is captured in metrics, so you can only see it from personal experience.

Me1000 · 2 years ago

Interesting, Claude 3 Opus has been better than GPT4 for me. Mostly in that I find it does a better (and more importantly, more thorough) job of explaining things to me. For coding tasks (I'm not asking it to write code, but instead to explain topics/code/etc to me) I've found it tends to give much more nuanced answers. When I give it long text to converse about, I find Claude Opus tends to have a much deeper understanding of the content it's given, where GPT4 tends to just summarize the text at hand, whereas Claude tends to be able to extrapolate better.

robocat · 2 years ago

How much of this is just that one model responds better to the way you write prompts?

Much like you working with Bob and opining that Bob is great, and me saying that I find Jack easier to work with.

CharlesW · 2 years ago

This was with Claude Opus, vs. one of the lesser variants? I really like Opus for English copy generation.

ein0p · 2 years ago

Opus, yes, the $20/mo version. I usually don’t generate copy. My use cases are code (both “serious” and “the nice to have code I wouldn’t bother writing otherwise”), learning how to do stuff in unfamiliar domains, and just learning unfamiliar things in general. It works well as a very patient teacher, especially if you already have some degree of familiarity with the problem domain. I do have to check it against primary sources, which is how I know the percentage of hallucinations is very low. For code, however I don’t even have to do that, since as a professional software engineer I am the “primary source”.

CuriouslyC · 2 years ago

GPT4 is better at responding to malformed, uninformative or poorly structured prompts. If you don't structure large prompts intelligently Claude can get confused about what you're asking for. That being said, with well formed prompts, Claude Opus tends to produce better output than GPT4. Claude is also more flexible and will provide longer answers, while ChatGPT/GPT4 tend to always sort of sound like themselves and produce short "stereotypical" answers.

sebastiennight · 2 years ago

> ChatGPT/GPT4 tend to always sort of sound like themselves

Yes I've found Claude to be capable of writing closer to the instructions in the prompt, whereas ChatGPT feels obligated to do the classic LLM end to each sentence, "comma, gerund, platitude", allowing us to easily recognize the text as a GPT output (see what I did there?)

cheema33 · 2 years ago

> It’s very one sided in favor of GPT4

My experience has been the opposite. I subscribe to multiple services as well and copy/paste the same question to all. For my software dev related questions, Claude Opus is so far ahead that I am thinking that it no longer is necessary to use GPT4.

For code samples I request, GPT4 produced code fails to even compile many times. That almost never happens for Claude.

thefourthchime · 2 years ago

Totally agree. I do the same and subscribe to all three, at least whenever our new version comes out

My new litmus test is “give me 10 quirky bars within 200 miles of Austin.”

This is incredibly difficult for all of them, gpt4 is kind of close, Claude just made shit up, Gemini shat itself.

meowtimemania · 2 years ago

Have you tried Poe.com? You can access all the major llm’s with one subscription

thisgoesnowhere · 2 years ago

Xenoamorphous · 2 years ago

> We always extract json. We don’t need JSON mode

I wonder why? It seems to work pretty well for me.

> Lesson 4: GPT is really bad at producing the null hypothesis

Tell me about it! Just yesterday I was testing a prompt around text modification rules that ended with “If none of the rules apply to the text, return the original text without any changes”.

Do you know ChatGPT’s response to a text where none of the rules applied?

“The original text without any changes”. Yes, the literal string.

You know all the stories about the capricious djinn that grants cursed wishes based on the literal wording? That's what we have. Those of us who've been prompting models in image space for years now have gotten a handle on this but for people who got in because of LLMs, it can be a bit of a surprise.

One fun anecdote, a while back I was making an image of three women drinking wine in a fancy garden for a tarot card, and at the end of the prompt I had "lush vegetation" but that was enough to tip the women from classy to red nosed frat girls, because of the double meaning of lush.

gmd63 · 2 years ago

Programming is already the capricious djinn, only it's completely upfront as to how literally it interprets your commands. The guise of AI being able to infer your actual intent, which is impossible to do accurately, even for humans, is distracting tech folks from one of the main blessings of programming: forcing people to think before they speak and hone their intention.

heavyset_go · 2 years ago

The monkey paw curls a finger.

MPSimmons · 2 years ago

That's kind of adorable, in an annoying sort of way

phillipcarter · 2 years ago

> I wonder why? It seems to work pretty well for me.

I read this as "what we do works just fine to not need to use JSON mode". We're in the same boat at my company. Been live for a year now, no need to switch. Our prompt is effective at getting GPT-3.5 to always produce JSON.

Kiro · 2 years ago

There's nothing to switch to. You just enable it. No need to change the prompt or anything else. All it requires is that you mention "JSON" in your prompt, which you obviously already do.

mechagodzilla · 2 years ago

AmeliaBedeliaGPT

Suppafly · 2 years ago

If you look at any of the cake decorating fail sites, humans make that sort of mistake all the time.

If you used better prompts you could use a less expensive model.

"return nothing if you find nothing" is the level 0 version of giving the LLM an out. Give it a softer out ("in the event that you do not have sufficient information to make conclusive statements, you may hypothesize as long as you state clearly that you are doing so, and note the evidence and logical basis for your hypothesis") then ask it to evaluate its own response at the end.

codewithcheese · 2 years ago

Yeah also prompts should not be developed in abstract. Goal of a prompt is to activate the models internal respentations for it to best achieve the task. Without automated methods, this requires iteratively testing the models reaction to different input and trying to understand how it's interpreting the request and where it's falling down and then patching up those holes.

Need to verify if it even knows what you mean by nothing.

jsemrau · 2 years ago

In the end, it comes down to a task similar to people management where giving clear and simple instructions is the best.

azinman2 · 2 years ago

Which automated method do you use?

Deleted Comment

chromanoid · 2 years ago

GPT is very cool, but I strongly disagree with the interpretation in these two paragraphs:

I think in summary, a better approach would’ve been “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.”

Why is this crazy? Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking.

Natural language is the most probable output for GPT, because the text it was trained with is similar. In this case the developer simply leaned more into what GPT is good at than giving it more work.

You can use simple tasks to make GPT fail. Letter replacements, intentional typos and so on are very hard tasks for GPT. This is also true for ID mappings and similar, especially when the ID mapping diverges significantly from other mappings it may have been trained with (e.g. Non-ISO country codes but similar three letter codes etc.).

The fascinating thing is, that GPT "understands" mappings at all. Which is the actual hint at higher order pattern matching.

fl0id · 2 years ago

Well, or it is just memorizing mappings. Not like as in reproducing, but having vectors similar to mappings that it saw before.

Yeah, but isn't this higher order pattern matching? You can at least correct during a conversation and GPT will then use the correct mappings, probably most of the times (sloppy experiment): https://chat.openai.com/share/7574293a-6d08-4159-a988-4f0816...

kromem · 2 years ago

Tip for your 'null' problem:

LLMs are set up to output tokens. Not to not output tokens.

So instead of "don't return anything" have the lack of results "return the default value of XYZ" and then just do a text search on the result for that default value (i.e. XYZ) the same way you do the text search for the state names.

Also, system prompts can be very useful. It's basically your opportunity to have the LLM roleplay as X. I wish they'd let the system prompt be passed directly, but it's still better than nothing.

msp26 · 2 years ago

> But the problem is even worse – we often ask GPT to give us back a list of JSON objects. Nothing complicated mind you: think, an array list of json tasks, where each task has a name and a label.

> GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time.

This is just a prompt issue. I've had it reliably return up to 200 items in correct order. The trick is to not use lists at all but have JSON keys like "item1":{...} in the output. You can use lists as the values here if you have some input with 0-n outputs.

waldrews · 2 years ago

I've been telling it the user is from a culture where answering questions with incomplete list is offensive and insulting.

andenacitelli · 2 years ago

This is absolutely hilarious. Prompt engineering is such a mixed bag of crazy stuff that actually works. Reminds me of how they respond better if you put them under some kind of pressure (respond better, or else…).

I haven’t looked at the prompts we run in prod at $DAYJOB for a while but I think we have at least five or ten things that are REALLY weird out of context.

It's not even that crazy, since it got severely punished in RLHF for being offensive and insulting, but much less so for being incomplete. So it knows 'offensive and insulting' is a label for a strong negative preference. I'm just providing helpful 'factual' information about what would offend the user, not even giving extra orders that might trigger an anti-jailbreaking rule...

7thpower · 2 years ago

Can you elaborate? I am currently beating my head against this.

If I give GPT4 a list of existing items with a defined structure, and it is just having to convert schema or something like that to JSON, it can do that all day long. But if it has to do any sort of reasoning and basically create its own list, it only gives me a very limited subset.

I have similar issues with other LLMs.

Very interested in how you are approaching this.

thibaut_barrere · 2 years ago

Not sure if that fits the bill, but here is an example with 200 sorted items based on a question (example with Elixir & InstructorEx):

https://gist.github.com/thbar/a53123cbe7765219c1eca77e03e675...

If you show your task/prompt with an example I'll see if I can fix it and explain my steps.

Are you using the function calling/tool use API?

Civitello · 2 years ago

> Every use case we have is essentially “Here’s a block of text, extract something from it.” As a rule, if you ask GPT to give you the names of companies mentioned in a block of text, it will not give you a random company (unless there are no companies in the text – there’s that null hypothesis problem!). Make it two steps, first: > Does this block of text mention a company? If no, good you've got your null result. If yes: > Please list the names of companies in this block of text.