If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them?
People bring problems to the LLM, the LLM produces some text, people use it and later return to iterate. This iteration functions as a feedback for earlier responses from the LLM. If you judge an AI response by the next 20 rounds of interaction or more you can gauge if it was useful or not. They can create RLHF data this way, using hindsight or extra context from other related conversations of the same user on the same topic. That works because users try the LLM ideas in reality and bring outcome results back to the model, or they simply recall from their personal experience if that approach would work or not. The system isn't just built to be right; it's built to be correctable by the user base, at scale.
OpenAI has 500M users, if they generate 1000 tokens/user/day that means 0.5T interactive tokens/day. The chat logs dwarf the original training set in size and are very diverse, targeted to our interests, and mixed with feedback. They are also "on policy" for the LLM, meaning they contain corrections to mistakes the LLM made, not generic information like web scrape.
You're right that LLMs eventually might not even need to crawl the web, they have the whole society dump data into their open mouths. That did not happen with web search engines, only social networks did that in the past. But social networks are filled with our cultural wars and self conscious posing, while the chat room is an environment where we don't need to signal our group alignment.
Web scraping gives you humanity's external productions - what we chose to publish. But conversational logs capture our thinking process, our mistakes, our iterative refinements. Google learned what we wanted to find, but LLMs learn how we think through problems.
It seems that LLMs, as they work today, make developers more productive. It is possible that they benefit less experienced developers even more than experienced developers.
More productivity, and perhaps very large multiples of productivity, will not be abandoned due roadblocks constructed by those who oppose the technology due to some reason.
Examples of the new productivity tool causing enormous harm (eg: bug that brings down some large service for a considerable amount of time) will not stop the technology if it being considerable productivity.
Working with the technology and mitigating it's weaknesses is the only rational path forward. And those mitigation can't be a set of rules that completely strip the new technology of it's productivity gains. The mitigations have to work with the technology to increase its adoption or they will be worked around.
You seem to be claiming that this is a binary, either we will or won’t use llms, but the author is mostly talking about risk mitigation.
By analogy it seems like you’re saying the author is fundamentally against the development of the motor car because they’ve pointed out that some have exploded whereas before, we had horses which didn’t explode, and maybe we should work on making them explode less before we fire up the glue factories.
I’m glad that trump has returned us to a world where quotes from the 5th century bc seem like commentary on current affairs, since it means that all my time learning about power dynamics in political systems during antiquity is now completely relevant to dealing with current events, rather than a giant waste of time.
For the uninitiated, data lakes are used to centralize vast quantities of data - often consumer data - usually by large organizations and governments to provide insights and inform decisions within the organization. Some call this surveillance capitalism.
Cloudera IPO'd somewhere around $2B and was taken private in a deal led by KKR for around $5B.