Why do AI models use so many em-dashes?

My pet theory is similar to the training set hypothesis: em-dashes appear often in prestige publications. The Atlantic, The New Yorker, The Economist, and a few others that are considered good writing. Being magazines, there's a lot of articles over time, reinforcing the style. They're also the sort of thing a RLHF person will think is good, not because of the em-dash but because the general style is polished.

One thing I wondered is whether high prestige writing is encoded into the models, but it doesn't seem far fetched that there's various linkages inside the data to say "this kind of thing should be weighted highly."

kubb · 4 months ago

It also seems that LLMs are using them correctly — as a pause or replacement for a comma (yes, I know this is an imprecise description of when to use them).

Thanks to LLMs I learned that using the short binding dash everywhere is incorrect, and I can improve my writing because of it.

number6 · 4 months ago

Before the rise of the llms there was a post here on hn where someone explained how to use all the dashes — sadly llms took them from us

cornonthecobra · 4 months ago

This is mine as well, with the addition of books. If someone wanted to train a bot to sound more human, they would select data that is verifiably human-made.

The approachable tone of popular print media also preselects for the casual, highly-readable style I suspect users would want from a bot.

lunias · 4 months ago

I think you're correct. The first time I encountered (and recognized) an em-dash in someone's writing was in middle school, and the person that wrote it was someone that I considered to be academically superior to myself. I noticed though, that a lot of people in the same "smart kids" group would use them; almost as if they had worked together on their papers. Maybe they were just reading different material, but it definitely came across as: this will make my writing "look smart".

mailarchis · 4 months ago

pg uses emdashes too. I found it interesting to see emdashes on his essays from way back in early 2000s

tim333 · 4 months ago

That kind of fits with Altman saying they put them in because users liked them (https://www.linkedin.com/posts/curtwoodward_chatgpt-em-dash-...)

I guess in the past if you'd shown me a passage with em dashes I'd say it looks good because I associate it with the New Yorker and Economist, both of which I read. Now I'd be a bit more meh due to LLMs.

It’s a real pity to me that em-dashes are becoming so disliked for their association with AI. I have long had a personal soft spot for them because I just like them aesthetically and functionally. I prided myself on searching for and correctly using em, en, and regular dashes, had a Google docs shortcut for turning `- - -` into `—` and more recently created an Obsidian auto-replacement shortcut that turns `-em` into `—`. Guess I’ll just have to use it sparingly and keep my prose otherwise human.

jasonvorhe · 4 months ago

Don't change your behaviour because some corporations made questionable decisions.

Your readers won't care about the dashes as long as the texts read like they had human origins and you have something to tell.

keiferski · 4 months ago

Unfortunately a lot of contests etc. are anti-AI usage without having a formal system for detecting it. In practice that means anyone using a lot of em-dashes will be flagged by a reviewer as AI-likely.

Deleted Comment

whynotmakealt · 4 months ago

If I found em-dashes and other patterns like its just not X but Y and all the other things we correlate with AI, I might call a person using it.

I don't understand the purpose of using LLM's to write articles unless someone wants to be the middleman of slop and if that's the case, I'd rather cut middlemans and get slop directly from the AI models, instead of pasting the output of what chatgpt generated, give me the prompt and maybe temperature/other settings if need be to make it more reproducible but the prompt itself could be enough smh

I am not saying you should change your writing style, but at the same time, you have to understand, if someone writes like AI, Chances are that we are too tired of looking too deep into it to find if its written by AI or not, we are tired of it & so you must understand our or anybody's frustration if they call out someone's writing as AI.

For those using AI to write articles/etc. : If you are passionate about something, write about it, write what you want, how you want and you will be proud. But if you use LLM, you will constantly be called upon and frankly, it reduces the purpose of writing.

For code, there is a debate that code is just an means to an end (which is to do stuff like scripts etc.) but there is no end to writing, for what? for more views/etc., there is no point in getting such attention or anything considering it would just be negative attention if I or anyone found AI writing.

Not sure why people use AI text generation for articles etc. Idk.

This is my alt but when I had first started out on HN, I thought my english was fine but then somebody pointed it out and I try to fix my grammar and now its second nature to me writing.

I would be curious to know the reasons as to why people write text stuff with AI in the first place. It doesn't make sense to me since the other side would use their slop to counter your slop, at that point just create a tldr post, why strech an article in more words than unnecessary (I feel like I also write a lot of filler words / yap personally but alright, atleast you know a human is writing this), I don't get the point of writing longer if you aren't even writing it, is it to get SEO or, is the end goal money like all things?

krzrak · 4 months ago

I feel you... For 30+ years of my life I prided myself for writing without typos and other mistakes (without autocorrect), using lots of bullet points, dashes, and words such as "delve into" or "underscore".

Now I find myself intentionally adding typos and other msitakes, and using less sophisticated language, just to not be accused of using AI.

hdgvhicv · 4 months ago

It’s been about 30 years since prose editors like word started underlining spelling mistakes in red. I don’t get typos when writing formal text in a keyboard. One handed on a touch screen phone with “auto correct” causing issues is another thing, but not for published articles.

matsemann · 4 months ago

I don't mind that in a "proper" text where it's actually useful and fun to read something with a flair. But maybe it has always irked people in short form (forum comments etc), but they've never just called it out until now? I do sometimes read something that gives me an "iamverysmart" feeling, as if the author used a thesaurus to find a synonym for half the words to sound clever but it just makes the whole thing incomprehensible.

topaz0 · 4 months ago

The distinctiveness of LLM language comes from overuse of specific words, not because it has a particularly sophisticated vocabulary. Some of the words it overuses may be considered sophisticated by some people, but that's not what makes it identifiable (or what makes it grating). It's still not hard to distinguish your voice from LLMs by being thoughtful about style at all.

(Edit: corrected (unintentional) typo)

topaz0 · 4 months ago

Part of it is the guilt-by-association with the other bad writing habits of LLMs, but I think a lot of it is just that LLMs genuinely overuse them, and that homogeneity is grating just like it's grating when you notice a text reuses a particular noticeable word or whatever. As a fellow em-dash user, I have sometimes noticed myself overusing them too, and revised accordingly, starting well before the proliferation of this particular cancer.

So I think you can keep using em-dashes without being associated with LLMs as long as you reserve them for particularly effective/tasteful occasions.

damnesian · 4 months ago

In my mind, their rightful place is transcription of written speech where the speaker pauses, and either inserts an island idea, or changes course. The comma doesn't suffice, because it's bridging an initial idea with expounding on the same idea. But so many times in written text I see it abused, lazily employed, because the author used a sentence fragment for effect, or wanted to amp up the pause and drama when a comma or, hell, even a semi-colon would have served the purpose better.

The advent of the generic AI writing style has had one good effect on my own work: making me take an unflinching look at my own laziness in writing. Now I tend to clean things up while at the same time try to inject some personality in order to NOT be dismissed as AI.

nandomrumber · 4 months ago

I agree, parentheses are not only used incorrectly in a lot of online writing, they’re also ugly.

avazhi · 4 months ago

The em dash is just one of a group of traits that make something obviously written by a bot. If you use em dashes in conjunction with good writing then nobody will give a shit.

eastbound · 4 months ago

Cmd + “-“ = –

Cmd + Shift + “-“ = —

Let’s spread the word until everyone fancy uses them, and then those who criticize text for coming from LLMs will be ridiculed by our ridiculous skills.

Etheryte · 4 months ago

That's interesting, for me those shortcuts are with option, not command. On my laptop, the first shortcut you wrote down is used to zoom out.

latexr · 4 months ago

It’s ⌥ instead of ⌘, and those exact shortcuts depend on keyboard layout. You posted the US version, but others reverse the em and en dashes.

iansteyn · 4 months ago

I tried some of these today, unfortunately it seems they’re not universal across programs.

lm28469 · 4 months ago

While you're automated out of your dashes people are automated out of their jobs, relax you'll be ok

Mawr · 4 months ago

Try out semicolons instead; they're never used but fun to play with too!

Xorakios · 4 months ago

semicolons seem to more accurately separate follow-up thoughts than em-dashes to my meathead, and I asked Perplexity/Comet this morning: what is easiest to process a whole list of options to save processing power and give most accurate results.

line breaks was first; semi-colons was second.

(and yep, I goofed around with both those ;)

lordnacho · 4 months ago

spuz · 4 months ago

According to the CEO of Medium, the reason is because their founder, Ev Williams, was a fan of typography and asked that their software automatically convert two hyphens (--) into a single em-dash. Then since Medium was used as a source for high-quality writing, he believes AI picked up a preference for em-dashes based on this writing.

https://youtu.be/1d4JOKOpzqU?si=xXDqGEXiawLtWo5e&t=569

hshdhdhehd · 4 months ago

If medium was a source why doesnt AI models stop half way through their output and ask for subscription and/or payment?

The whole interview goes into that and talks about the benefits and costs of allowing search and AI crawlers access to Medium articles.

scrollaway · 4 months ago

Give OpenAI a few more months :)

don_neufeld · 4 months ago

[Founding CTO of Medium here]

It wasn’t just Ev - I can confirm that many of us were typography nuts ;)

Marcin for example - did some really crazy stuff.

https://medium.design/crafting-link-underlines-on-medium-7c0...

nicwolff · 4 months ago

He fixed underlines on Medium 11 years ago – and someone un-fixed them since then?

trvz · 4 months ago

[flagged]

steve1977 · 4 months ago

> since Medium was used as a source for high-quality writing

That explains a lot…

bazoom42 · 4 months ago

Isn’t the two hyphens just a traditional way to emulate m-dash in ascii? I believe Word does the same.

ifh-hn · 4 months ago

I thought 2 hyphens is en-dash and 3 was em-dash.

dagmx · 4 months ago

That’s not just a Medium thing, lots of text systems do exactly that.

Apple has done it across their systems for ages. Microsoft did it in Word for a long time too.

It was more or less standard on any tool that was geared towards writers long before Medium was a thing.

sixhobbits · 4 months ago

I would think the most obvious explanation is that they are used as part of the watermark to help OpenAI identify text - i.e. the model isn't doing it at all but final-pass process is adding in statistical patterns on top of what the model actually generates (along with words like 'delve' and other famous GPT signatures)

I don't have evidence that that's true, but it's what I assume and I'm surprised it's not even mentioned as a possibility.

When I studied author profiling, I built models that could identify specific authors just by how often they used very boring words like 'of' and 'and' with enough text, so I'm assuming that OpenAI plays around with some variables like that which would much harder to humans to spot, but probably uses several layers of watermarking to make it harder to strip, which results in some 'obvious' ones too.

constantius · 4 months ago

Obvious watermarking that consistently gets a lot of hate from vocal minorities (devs, journalists, etc.) would probably be simply removed for the benefit of those other layers you mention.

But the watermarking layers is a fascinating idea (and extremely likely to exist), thanks!

xandrius · 4 months ago

Honestly the most obvious explanation is that the training set has a lot of them, not some sort of watermarking conspiracy. Occam's razor at its best.

Fricken · 4 months ago

Historically I would see far more em-dashes in capital "L" literature than I would in more casual contexts. LLMs assign more weight to literature than to things like reddit comments or Daily Mail articles.

Gigachad · 4 months ago

I think this is most of it. The most obvious sign of AI slop is mismatched style with the medium. People are posting generated text to Reddit which reads like a school essay or linkedin inspirational post. Something no one did before. So even though the style is not unprecedented, it’s taken out of its original context.

numpad0 · 4 months ago

I think the more correct question is why humans don't use em dashes in the first place while LLMs do all the time. And the short answer to that is, because it's Unicode stuff.

Regular computers for human use only support ASCII in US or ISO-5589-1 in EU still to this day, and Unicode reliant East Asian users turn off Unicode input modes before typing English words, leaving the Asian part mostly in pure Unicode and alphanumeric part pure ASCII. So Unicode-ASCII mixed text is just odd by itself. This in turn makes use of em dashes odd.

Same with emojis. LLMs generate Unicode-mapped tokens directly, so they can vocalize any characters within full Unicode ranges. Humans with keyboards(physical or touchscreen) can mostly only produce what's on them.

mrandish · 4 months ago

> real humans who like em-dashes have stopped using them out of fear of being confused with AI.

Yeah, this is me. I've always liked good type and typography. 5 or 6 years ago I added em-dash to my keyboard configs to make typing it in convenient - mostly because I just think it just looks nicer. But lately I don't use it much because... AI.

However, in recent weeks someone accused an HN post of mine as being from a bot, despite the fact I used a plain old hyphen and not an em-dash. There was nothing in the post which seemed AI-like except possibly that hyphen. At the time, I realized that person probably just couldn't tell a hyphen from a real em-dash. So maybe that means I have to not use any dash at all.

xg15 · 4 months ago

The "book scanning" hypothesis doesn't sound so bad — but couldn't it simply be OCR bias? I imagine it's pretty easy for OCR software to misrecognize hyphens or other kinds of dashes as em-dashes if the only distinction is some subtle differences in line length.

flowerthoughts · 4 months ago

You'd think context-less OCR would prefer interpreting it as a simple hyphen, since that's the most common dash. Seems unlikely any bias would go the other way.