If you try to train a DNN to solve a classical ML problem like the "Wine Quality" dataset from the UCI Machine Learning repo [0], you will get abysmal results and overfitting.
The "magic" of LLMs comes from the training paradigm. Because the optimization is word prediction, you effectively have a data sample size equal to the number of words in the corpus - an inconceivably vast number. Because you are training against a vast dataset, you can use a proportionally immense model (e.g. 400B parameters) without overfitting. This vast (but justified) model complexity is what creates the amazing abilities of GPT/etc.
What wasn't obvious 10 years ago was the principle of "reusability" - the idea that the vastly complex model you trained using the LLM paradigm would have any practical value. Why is it useful to build an immensely sophisticated word prediction machine, who cares about predicting words? The reason is that all those concepts you learned from word-prediction can be reused for related NLP tasks.
Zhang et al (2021) 'Understanding deep learning (still) requires rethinking generalization'
If someone have something explaining that I'll be grateful