If that's the case, then I have a bad feeling for the state of our industry. My experience with LLMs is that their code does _not_ cut it. The hallucinations are still a serious issue, and even when they aren't hallucinating they do not generate quality code. Their code is riddled with bugs, bad architectures, and poor decisions.
Writing good code with an LLM isn't any faster than writing good code without it, since the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking. You have to spend more or less the same amount of time with the LLM understanding the code, thinking about the problems, and verifying its work (and then reprompting or redoing its work) as you would just writing it yourself from the beginning (most of the time).
Which means that all these companies that are firing workers and demanding their remaining employees use LLMs to increase their productivity and throughput are going to find themselves in a few years with spaghettified, bug-riddled codebases that no one understands. And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.
Of course, there could be an unforeseen new order of magnitude jump. There's always the chance of that and then my prediction would be invalid. But so far, what I see is a fast approaching plateau.
SWE bench from ~30-40% to ~70-80% this year
https://arxiv.org/html/2410.11840v1#:~:text=Scaling%20laws%2....
Yeah, that makes this result a lot less impressive for me.
"Raising visibility on this note we added to address ARC "tuned" confusion:
> OpenAI shared they trained the o3 we tested on 75% of the Public Training set.
This is the explicit purpose of the training set. It is designed to expose a system to the core knowledge priors needed to beat the much harder eval set.
The idea is each training task shows you an isolated single prior. And the eval set requires you to recombine and abstract from those priors on the fly. Broadly, the eval tasks require utilizing 3-5 priors.
The eval sets are extremely resistant to just "memorizing" the training set. This is why o3 is impressive." https://x.com/mikeknoop/status/1870583471892226343
So we might start seeing test flights actually entering orbit soon. Possibly even carrying some real payloads soon.