The trick (aside from the order) is the training process by which they are derived from their source data. Simply enumerating the states and transitions in the source data and the probability of each transition from each state in the source doesn’t get you an LLM.
You could probably point your code to Google Books N-grams (https://storage.googleapis.com/books/ngrams/books/datasetsv3...) and get something that sounds (somewhat) reasonable.
Strictly speaking, this is true of one particular way (the most straightforward) to derive a Markov chain from a body of text; a Markov chain is just a probabilistic model of state transitions where the probability of each possible next state is dependent only on the current state. Having the states be word sequence of some number of words, overlapping by all but one word, and having the probabilities being simply the frequency with which the added word in the target state follows the sequence in the source state in the training corpus is one way you can derive a Markov chain from a body of text, but not the only one.
I see this as a strength; try training an LLM on a 42KB text to see if it can produce a coherent output.
https://en.wikipedia.org/wiki/Natural_nuclear_fission_reacto...
"Please unblock challenges.cloudflare.com to proceed"
It looks like it's because a Cloudflare Global Network issue
I will further improve my code and publish it when I am satisfied on my Github account.
It started as a Simple Language Model [0] as they differ from ordinary Markov generators by incorporating a crude prompt mechanism and a kind of very basic attention mechanism named history. My SLM uses Partial Matching (PPM). The one in the link is character-based and is very simple, but mine uses tokens and is 1300 C lines long.
The tokenizer tracks the end of sentences and paragraphs.
I didn't use part-of-a-word algorithms as LLMs do, but it's trivial to incorporate. Tokens are represented by a number (again as in LLMs), not a character chain.
I use Hash Tables for the Model.
There are several mechanisms used for fallbacks when the next state function fails. One of them uses the prompt. It is not demonstrated here.
Several other implemented mechanisms are not demonstrated here, like model pruning, skip-grams. I am trying to improve this Markov text generator, and some tips in the comments will be of great help.
But my point is not to make an LLM, it's just that LLMs produce good results not because of their supposedly advanced algorithms, but because of two things:
- There is an enormous amount of engineering in LLMs, whereas usually there is nearly none in Markov text generators, so people get the impression that Markov text generators are toys.
- LLMs are possible because they use impressive hardware improvements over the last decades. My text generator only uses 5MB of RAM when running this example! But as commentators told, the size of the model explodes quickly, and this is a point I should improve in my code.
And indeed, LLMs, even small LLMs like NanoGPT are unable to produce results as good as my text generator with only 42KB of training text.
https://github.com/JPLeRouzic/Small-language-model-with-comp...