They're analyzing input embedding models, not LLMs. I'm not sure how the authors justify making claims about the inner workings of LLMs when they haven't actually computed a forward pass. The EMatrix is not an LLM, its a lookup table.
Just to highlight the ridiculousness of this research, no attention was computed! Not a single dot product between keys and queries. All of their conclusions are drawn from the output of an embedding lookup table.
The figure showing their alignment score correlated with model size is particularly egregious. Model size is meaningless when you never activate any model parameters. If Bert is outperforming Qwen and Gemma something is wrong with your methodology.
They used token embeddings directly and not intermediate representations because the latter depend on the specific sentence that the model is processing. Data on human judgment was however collected without any context surrounding each word, thus using the token embeddings seem to be the most fair comparison.
Otherwise, what sentence(s) would you have used to compute the intermediate representations? And how would you make sure that the results aren't biased by these sentences?
And like the other commenter said, you can absolutely feed single tokens through the model. Your point doesn’t make any sense though regardless. How about priming the model with “You’re a helpful assistant” just like everyone else does.