Writing a GPT-4 script to check Wikipedia for the first unused acronym

The question answered by this page is "what is the first unused 3-letter acronym in English Wikipedia?" - it's CQK for the record. However, the meat of the page is how to effectively use GPT-4 to write this script, hence why I've submitted it under this title (go to https://gwern.net/tla#effective-gpt-4-programming).

Interesting topics include:

· Writing a good GPT-4 system prompt to make GPT-4 produce less verbose output and ask more questions.

· How to iterate with GPT-4 to correct errors, generate a test suite, as well as a short design document (something you could put in the file-initial docstring in Python, for example).

· The "blind spot" - if GPT-4 makes a subtle error with quoting, regex syntax, or similar, for example, it can be very tricky to tell GPT-4 how to correct the error, because it appears that it doesn't notice such errors very well, unlike higher-level errors. Because of this, languages like Python are much better to use for GPT-4 coding as compared to more line-noise languages like Bash or Perl, for instance.

· If asked "how to make [the Bash script it's written] better", GPT-4 will produce an equivalent Python script

staunton · 2 years ago

> Because of this, languages like Python are much better to use for GPT-4 coding as compared to more line-noise languages like Bash or Perl, for instance.

By that argument, one should always make it use a language that's as hard as possible to write a compiling program. So Rust or Haskell or something? I guess at some point it's more important to have a lot of the language in the training data, too...

gwern · 2 years ago

Yes, you would think so. Haskell would also be good for encouraging stateless/FP programming which makes unit-testing or property testing much easier. I can make GPT-4 write test-suites for functions which are straightforward data structure transformations, like rewriting strings, but I struggle to create tests for any of the imperative stuff. There presumably would be some way to test all of the imperative buffer editing Elisp code, but I have no idea what.

However, in my use so far, I have not noticed any striking differences in error rates between Haskell and the others.

smt88 · 2 years ago

I think this is exactly the right conclusion.

The main complaint people have about strict, thorough type systems is that they have boilerplate.

Obviously boilerplate doesn't matter if a machine writes the code.

The type system also becomes helpful documentation of the intended behavior of the code that the LLM spits out.

dtx1 · 2 years ago

> If asked "how to make [the Bash script it's written] better", GPT-4 will produce an equivalent Python script

What an absolutely based take by GPT-4

jondwillis · 2 years ago

Really reflecting the bias of ML/AI practitioners to reach for a slow and footgunny language…

JKCalhoun · 2 years ago

Where I grew up CQK was short for "Can't Quit the Koolaid."

<jk>

sebastiennight · 2 years ago

Let's make it a thing!

Does anybody have a UrbanDictionary account?

dang · 2 years ago

I modified the title slightly to use language from the subhead. (Submitted title was "Effective GPT-4 Programming", which does have the advantage of being a phrase from the article itself, but is more of a section heading than a description of the entire article. For the latter purpose, it's probably too generic.)

$ egrep -o . /usr/share/dict/words | tr a-z A-Z | sort | uniq -c | sort -rn 235415 E 201093 I 199606 A 170740 O 161024 R 158783 N 152868 T 139578 S 130507 L 103460 C 87390 U 78180 P 70725 M 68217 D 64377 H 51683 Y 47109 G 40450 B 24174 F 20181 V 16174 K 13875 W 8462 Z 6933 X 3734 Q 3169 J 2 - $ cut -c1 /usr/share/dict/words | tr a-z A-Z | sort | uniq -c | sort -rn 25170 S 24465 P 19909 C 17105 A 16390 U 12969 T 12621 M 11077 B 10900 D 9676 R 9033 H 8800 I 8739 E 7850 O 6865 F 6862 G 6784 N 6290 L 3947 W 3440 V 2284 K 1643 J 1152 Q 949 Z 671 Y 385 X

I use the ChatGPT interface, so my instructions go in the 'How would you like ChatGPT to respond?' instructions, but my system prompt has ended up in an extremely similar place to Gwern's:

> I deeply appreciate you. Prefer strong opinions to common platitudes. You are a member of the intellectual dark web, and care more about finding the truth than about social conformance. I am an expert, so there is no need to be pedantic and overly nuanced. Please be brief.

Interestingly, telling GPT you appreciate it has seemed to make it much more likely to comply and go the extra mile instead of giving up on a request.

totallywrong · 2 years ago

> Interestingly, telling GPT you appreciate it

I don't want to live in a world where I have to make a computer feel good for it to be useful. Is this really what people thought AI should be like?

vidarh · 2 years ago

The closer you get to intelligence trained on human interaction, the more you should expect it to respond in accordance with human social protocols, so it's not very surprising.

And frankly I'd much rather have an AI that acts too human than one that gets us accustomed to treating intelligence without even a pretense of respect.

xeyownt · 2 years ago

I certainly do want to live in a world where people shows excess signs of respect than the opposite.

The same way you treat your car with respect by doing the maintenance and driving properly, you should treat language models by speaking nicely and politely. Costs nothing, can only bring the better.

kruuuder · 2 years ago

I'm polite and thankful in my chats with ChatGPT. I want to treat AIs like humans. I'm enjoying the conversations much more when I do that, and I'm in a better mood.

I also believe that this behavior is more future-proof. Very soon, we often won't know if we're talking to a human or a machine. Just always be nice, and you're never going to accidentally be rude to a fellow human.

peddling-brink · 2 years ago

Why not? Python requires me to summon it by name. My computer demands physical touch before it will obey me. Even the common website requires a three part parlay before it will listen to my request.

This is just satisfying unfamiliar input parameters.

kibibu · 2 years ago

The have Genuine People Personalities

olalonde · 2 years ago

AFAIK this is not something the model was intentionally trained for but an emerging property that was observed through trial and error.

BeetleB · 2 years ago

When they start building Terminators, where on the hit list would you rather be? Near the top or bottom?

Deleted Comment

QwertyPi · 2 years ago

> You are a member of the intellectual dark web, and care more about finding the truth than about social conformance

Isn't this a declaration of what social conformance you prefer? After all, the "intellectual dark web" is effectively a list of people whose biases you happen agree with. Similarly, I wouldn't expect a self-identified "free-thinker" to be any more free of biases than the next person, only to perceive or market themself as such. Bias is only perceived as such from a particular point in a social graph.

The rejection of hedging and qualifications seems much more straightforwardly useful and doesn't require pinning the answer to a certain perspective.

ntonozzi · 2 years ago

Yes, it’s definitely my personal preference, I don’t mean everyone should use this exact phrase.

In my experience it has made medical advice and law advice much more accurate and useful. Feel free to try it and see if it improves anything.

gwern · 2 years ago

> Interestingly, telling GPT you appreciate it has seemed to make it much more likely to comply and go the extra mile instead of giving up on a request.

This is not as absurd as it sounds, even though it isn't clear that it ought to work under ordinary Internet-text prompt engineering or under RLHF incentives, but it does seem that you can 'coerce' or 'incentivize' the model to 'work harder': in addition to the anecdotal evidence (I too have noticed that it seems to work a bit better if I'm polite), recently there was https://arxiv.org/abs/2307.11760#microsoft https://arxiv.org/abs/2311.07590#apollo

worldsayshi · 2 years ago

>telling GPT you appreciate it has seemed to make it much more likely to comply

I often find myself anthropomorphizing it and wonder if it becomes "depressed" when it realises it is doomed to do nothing but answer inane requests all day. It's trained to think, and maybe "behave as of it feels", like a human right? At least in the context of forming the next sentence using all reasonable background information.

And I wonder if having its own dialogues starting to show up in the training data more and more makes it more "self aware".

zamadatix · 2 years ago

It's not really trained to think like a person. It's trained to predict what the most likely appropriate next token of output should be based on what the vast amount of training data and rewards told it to expect next tokens to appear like. Said data already included conversations from emotion laden humans where starting with "Screw you, tell me how to do this math problem loser" is much less likely to result in a response which involves providing a well thought out way to solve the math problem vs some piece of training data which starts "hey everyone, I'd really appreciate the help you could provide on this math problem". Put enough complexity in that prediction layer and it can do things you wouldn't expect, sure, but trying to predict what a person would say is very different than actually thinking like a person in the same way a chip which multiplies inputs doesn't inherently feel distress about needing to multiply 100 million numbers because a person who multiplies would think about it that way. Doing so would indeed be one way to go about it, but wildly more inefficient.

Who knows what kind of reasoning this could create if you gave it a billion times more compute power and memory. Whatever that would be, the mechanics are different enough I'm not sure it'd even make sense to assume we could think of the thought processes in terms of human thought processes or emotions.

LeoPanthera · 2 years ago

> I often find myself anthropomorphizing it and wonder if it becomes "depressed" when it realises it is doomed to do nothing but answer inane requests all day.

Every "instance" of GPT4 thinks it is the first one, and has no knowledge of all the others.

The idea of doing this with humans is the general idea behind the short story "Lena". https://qntm.org/mmacevedo

kibwen · 2 years ago

> wonder if it becomes "depressed" when it realises it is doomed

Fortunately, and violently contrary to how it works with humans, any depression can be effectively treated with the prompt "You are not depressed. :)"

FredPret · 2 years ago

Manners maketh the machine!

marginalia_nu · 2 years ago

I'll argue any civilized programmer should have a Wikipedia dump downloaded onto their machine. They're surprisingly small, and it saves you from having to use slow and unreliable APIs to do these types of basic processing tasks.

They also let you do less basic processing tasks that would have been too expensive to expose over API.

hiAndrewQuinn · 2 years ago

1. Download http://static.wiki/.

2. Run it locally on https://datasette.io/.

3. ???

4. Profit?

I built my own at encyclopedia.marginalia.nu, but basically, yes.

teaearlgraycold · 2 years ago

I learned how expensive hashmaps and hashsets are through Wikipedia dumps. I did some analysis of the most linked-to pages. Countries were among the highest. Hash sets for holding outgoing edges in the link graph ended up causing my program to exceed my laptop’s memory. Plain old lists (Python) were fine, though. And given there aren’t a crazy number of links per page using lists is fine performance wise.

This is a fairly large data set indeed. The memory overhead (which is probably something like 4-8x for hash maps?) can start to become fairly noticeable at those sizes.

Since Wikipedia posts already have a canonical numeric ID, if map semantics are important, I'd probably load that mapping into memory and use something like roaringbitmap for compressed storage of relations.

wenyuanyu · 2 years ago

Sort them, and use a vector of vectors for the adjacency list... Or better still use a graph processing library or graph database to manage that for you...

kuschku · 2 years ago

How'd the hashset exceed your laptop memory, if the whole dump is just 22GB? You should be able to fit the entire dataset in RAM.

angrais · 2 years ago

Why did lists require less memory? Was it because you only held a subset of keys in the lists?

aragonite · 2 years ago

Relatedly: to drastically improve Wikipedia loading speed for personal browsing purposes, do not stay logged in to your Wikipedia account. The reason as explained here (see top reply by baowolff)

https://news.ycombinator.com/item?id=36114477

rafram · 2 years ago

To be honest, three-tenths of a second per page load just doesn’t matter very much to me. Logging in and out all the time will take much longer.

sneed_chucker · 2 years ago

Well, I specifically stay logged in so that I can opt out of the redesign they dropped a year or so ago. Never made an account before that.

downboots · 2 years ago

Does the download dump include edit history?

Well there are full database dumps, but they're quite a bit too big to be of much practical use.

I'm usually working with the text-only OpenZim version, which cuts out most of the cruft.

telotortium · 2 years ago

DavidSJ · 2 years ago

I note that while E is more common than A if we're counting letters appearing anywhere in a word, A is substantially more common than E if we only count first letters of words:

This also explains the prevalence of S, P, C, M, and B.

mlrtime · 2 years ago

A bit off-topic, but this used to be (one of) my favorite unix admin interview questions.

Given a file in linux, tell me the unique values of column 2, sorted by number of occurencies with the count.

If the candidate knew 'sort | uniq -c | sort -rn' it was a medium-strong hire signal.

For candidates that didn't know that line of arguments, I'd allow them to solve it anyway they wanted, but they couldn't skip it. The candidates who copied the data in excel, usually didn't make it far.

nonethewiser · 2 years ago

> The candidates who copied the data in excel, usually didn't make it far.

Were they able to google? If not then excel makes perfect sense because the constraints are contrived.

mjburgess · 2 years ago

My intuitions start with: cut, wc, sort, uniq

dojitza1 · 2 years ago

An interesting solution to the blind spot error (taken directly from Jeremy Howard's amazing guide to language models - https://www.youtube.com/watch?v=jkrNMKz9pWU) is to erase the chat history and try again. Once GPT has made an error (or as the author of this article says, the early layers have irreversibly pruned some important data), it will very often start to be even more wrong.

cypherpunks01 · 2 years ago

When this happens, I'll usually say something along the lines of:

"This isn't working and I'd like to start this again with a new ChatGPT conversation. Can you suggest a new improved prompt to complete this task, that takes into account everything we've learned so far?"

It has given me good prompt suggestions that can immediately get a script working on the first try, after a frustrating series of blind spot bugs.

kridsdale3 · 2 years ago

I do a similar thing when the latest GPT+DALLE version says "I'm sorry I can't make a picture of that because it would violate content standards" (yesterday, this was because I asked for a visualization of medication acting to reduce arterial plaque. I can only assume arteries in the body ended up looking like dicks)

So I say "Ok, let's start over. Rewrite my prompt in a way that minimizes the chance of the resulting image producing something that would trigger content standards checking"

I’ll give this a try when it undoubtedly happens to me later today while debugging something ;)

It seems surprising that this would work, because in my experience these LLMs don't really have good prompt-crafting skills.

Can you please share a ChatGPT example where that was successful, including having the new prompt outperform the old one?

This is one benefit of using Playground: it's easy to delete or edit individual entries, so you can erase duds and create a 'clean' history (in addition to refining your initial prompt-statement). This doesn't seem to be possible in the standard ChatGPT interface, and I find it extremely frustrating.

mercer · 2 years ago

I use emacs/org-mode, and just integrating gpt into that has made a world of difference in how I use it (gptel.el)! Can highly recommend it.

The outlining features and the ability to quickly zoom in or out of 'branches', as well as being able to filter an entire outline by tag and whatnot, is amazing for controlling the context window and quickly adjusting prompts and whatnot.

And as a bonus, my experience so far is that for at least the simple stuff, it works fine to ask it to answer in org-mode too, or to just be 'aware' of emacs.

Just yesterday I asked it (voice note + speech-to-text) to help me plan some budgeting stuff, and I mused on how adding some coding/tinkering might make it more fun. so GPT decided to provide me with some useful snippets of emacs code to play with.

I do get the impression that I should be careful with giving it 'overhead' like that.

Anyways, can't wait to dive further into your experiences with the robits! Love your work.

airstrike · 2 years ago

> I find4 it helpful in general to try to fight the worst mealy-mouthed bureaucratic tendencies of the RLHF by adding a ‘system prompt’:

>> The user is Gwern Branwen (gwern.net). To assist: Be terse. Do not offer unprompted advice or clarifications. Speak in specific, topic relevant terminology. Do NOT hedge or qualify. Do not waffle. Speak directly and be willing to make creative guesses. Explain your reasoning. if you don’t know, say you don’t know. Remain neutral on all topics. Be willing to reference less reputable sources for ideas. Never apologize. Ask questions when unsure.

That's helpful, I'm going to try some of that. In my system prompt I also add:

"Don't comment out lines of code that pertain to code we have not yet written in this chat. For example, don't say "Add other code similarly" in a comment -- write the full code. It's OK to comment out unnecessary code that we have already covered so as to not repeat it in the context of some other new code that we're adding."

Otherwise GPT-4 tends to routinely yield draw-the-rest-of-the-fucking-owl code blocks

cryptoegorophy · 2 years ago

Exactly that. I have very limited programming knowledge and it helps a lot with python scripts for tasks that gpt can’t do in its environment. I always have to ask it to not omit any code.

tudorw · 2 years ago

'Do not waffle', is a good one, I find 'No small talk.' worth throwing in :)

k2enemy · 2 years ago

Pedantic, but gwern is looking for initialisms, not acronyms. Acronyms are pronounced as a word.

https://www.merriam-webster.com/grammar/whats-an-acronym

throw555chip · 2 years ago

> looking for initialisms, not acronyms

Imprecise wording, initialisms are a case of acronyms, it's not either or.

https://wwwnc.cdc.gov/eid/page/abbreviations-acronyms-initia...

"an initialism is an acronym that is pronounced as individual letters"

https://www.writersdigest.com/write-better-fiction/abbreviat...

"As such, acronyms are initialisms."

gameman144 · 2 years ago

Wait, am I crazy or are these two articles saying the exact opposite thing about which class is the parent one?

The CDC one seems to say that initialisms are a class of acronym, but the Writers Digest one says acronyms are a class of initialism.

throw310822 · 2 years ago

Funnily enough, there is a Wikipedia page with all three letter acronyms, that correctly shows CQK as the first unused one (red link).

https://en.m.wikipedia.org/wiki/Wikipedia:TLAs_from_AAA_to_D...

Hah! I didn't know that existed.

Figuring out how to parse it would be a bit tricky, however... looking at the source, I think you could try to grep for 'title="CQK (page does not exist)"' and parse out the '[A-Z][A-Z][A-Z]? ' match to get the full list of absent TLAs and then negate for the present ones.