But under the hood, the image will have to be transformed into features / embeddings before it can be decoded into text. Suppose that the image gets processed into 100 “image tokens”, which are subsequently decoded into 1000 “text tokens”.
Now forget that we are even talking about images or OCR. If you look at just the decoding process, you find that we were able to compress the output into a 10x smaller representation.
The implication for LLMs is that we don’t need 1000 tokens and 1000 token embeddings to produce the 1001st token, if we can figure out how to compress them into a 10x smaller latent representation first.
Correct?
Any reasonably "normal" person (anyone that's not severely autistic) will find there are people that we effortlessly connect with and many others we don't. It's the natural state.
Now in any sufficiently intelligent and psychologically OK person the act of eliciting / pushing emotional connection with people from the latter group (where there's no natural connection) should trigger a certain amount of internal disgust.
The fact that it doesn't seem to be the case with the author would indicate that he's more of an outlier. Based on his writing he does seem intelligent and psychologically OK, so there might be other factors at play. My point is that his journey might not be transferable 1:1.