Nice idea, naive implementation which leads to the output being unconvincing as hypothetical English words. I had a brief look and it seems to be proportionally selecting and sticking together sequences of letters sampled from English words (lib/word-probability.ts). This doesn't take into account syllable boundaries, the way the English spelling system maps between phones/phonemes and the phonotactic properties of English which is why the output looks unconvincing.
A better approach would be to use a markov chain built from sampling English text letter-by letter... an even better approach would be to build your stats from some source of English words in IPA transcription with syllable boundaries etc marked, then map from IPA to spelling via some kind of lookup table. We use a similar process in reverse in my research group for building datasets for doing Bayesian phylogenies of language families
Clearly you are far more of a linguist than I am, but from such a perspective, I had a similar impression; I reloaded the page several times and none of the words struck me as being remotely plausibly English. These are worse than most Hollywood scifi words/names.
A significant improvement on letter-by-letter, but not that much harder, is to use n-grams: "two letters to predict the third" etc. Still not "industry grade", but the results start making more sense.
A letter-by-letter markov chain would lead to similar unconvincing results. As you said, vocal groups matter much more than single letters. If you know anything about korean, they actually group letters into characters that way. If one could build such a markov chain for English it would be very convincing I think.
You should check out the VOLT paper, I think it would work well. It's a new technique for splitting up a vocabulary into subwords while minimizing entropy. These subwords could then be mixed and matched, maybe by a neural model, for better results.
I'm sure some people don't hear it, like "the dress", but for some of us it sounds like an Uncanny Valley of English: close but not quite, just enough for our brains to trip over / struggle to comprehend b/c it is so close.
As well as the associations with [1], this also made me think of one of my favourite essays, "Horsehistory study and the automated discovery of new areas of thought"[2]
Sorry, after a few refreshes not a single word was anything that looked remotely like English. It all looked like complete gibberish or words in another language. Most of them weren’t even pronounceable.
I think ailml is the offending sequence here. It's pretty difficult to say and doesn't sound like something that you'd find in a native English word.
There's calmly which is similar, to be fair, but there's something about the tongue positions for ailml that I find noticeably more difficult, it's too far forward.
Can anyone tell me more about how this works? Most of these don't resemble English words at all to me lol, wondering what the generative procedure/parameters are in the first place
a slender, membranous musclelike structure, believed to represent a cross between a cranium and the external spaces of fish and invertebrates, supporting the glans in most vertebrates
"a dynoderma is thought to have existed in all living organisms"
A better approach would be to use a markov chain built from sampling English text letter-by letter... an even better approach would be to build your stats from some source of English words in IPA transcription with syllable boundaries etc marked, then map from IPA to spelling via some kind of lookup table. We use a similar process in reverse in my research group for building datasets for doing Bayesian phylogenies of language families
Vocabulary Learning via Optimal Transport for Neural Machine Translation - https://arxiv.org/abs/2012.15671
https://jingjing-nlp.github.io/volt-blog/
https://github.com/Jingjing-NLP/VOLT
https://www.dictionary.com/browse/minable
https://www.youtube.com/watch?v=-VsmF9m_Nt8
Tangentially related - this is how I discovered Nightwish some 15 years ago: https://www.youtube.com/watch?v=gg5_mlQOsUQ
I know this comment doesn't add anything of value to the discussion per se, but that's given me the biggest laugh I've had in months.
Nightwish came into my life in the 00s, and I couldn't tell you one song meaning, yet I love the sound.
This is just a perfect video, thank you for sharing.
"English" starts at around 0:48, but the others are also worth a listen!
https://www.youtube.com/watch?v=Vt4Dfa4fOEY
[1] https://www.thisworddoesnotexist.com/ [2] https://interconnected.org/home/2021/06/16/horsehistory
Jabberwocky
’Twas brillig, and the slithy toves
All mimsy were the borogoves, “Beware the Jabberwock, my son! Beware the Jubjub bird, and shun He took his vorpal sword in hand; So rested he by the Tumtum tree And, as in uffish thought he stood, Came whiffling through the tulgey wood, One, two! One, two! And through and through He left it dead, and with its head “And hast thou slain the Jabberwock? O frabjous day! Callooh! Callay!” ’Twas brillig, and the slithy toves All mimsy were the borogoves, </obligatory>Many years of /. posts and other results might find you a version that's readable if you search.
There's calmly which is similar, to be fair, but there's something about the tongue positions for ailml that I find noticeably more difficult, it's too far forward.
I like these, especially the last.
http://www.thisworddoesnotexist.com/
as it also fakes the definition.
But if you want to write some Vogon like poetry, the words generated by Fakelish might be just fine.
dyn·o·derma
a slender, membranous musclelike structure, believed to represent a cross between a cranium and the external spaces of fish and invertebrates, supporting the glans in most vertebrates
"a dynoderma is thought to have existed in all living organisms"
Basically a big probability map. I'm guessing this was machine generated though, and it isn't clear to me how that was done.