If so, does it suggest we could “just” build a Markov Chain using the original training data and get similar performance to the LLM?
> I implemented imperative code that does what I’m proposing the transformer is doing. It produces outputs very similar to the transformer.
This means there is probably a way to bypass transformers and get the same results. Would be interesting if it's more efficient. Like given foundation model train something else and run it on much smaller device.
https://en.wikipedia.org/wiki/Ouija