Readit News logoReadit News
icpmacdo commented on Claude 4   anthropic.com/news/claude... · Posted by u/meetpateltech
piperswe · 9 months ago
How much of that is because the models are optimizing specifically for SWE bench?
icpmacdo · 9 months ago
not that much because its getting better at all benchmarks
icpmacdo commented on Claude 4   anthropic.com/news/claude... · Posted by u/meetpateltech
dbingham · 9 months ago
It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase.

If that's the case, then I have a bad feeling for the state of our industry. My experience with LLMs is that their code does _not_ cut it. The hallucinations are still a serious issue, and even when they aren't hallucinating they do not generate quality code. Their code is riddled with bugs, bad architectures, and poor decisions.

Writing good code with an LLM isn't any faster than writing good code without it, since the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking. You have to spend more or less the same amount of time with the LLM understanding the code, thinking about the problems, and verifying its work (and then reprompting or redoing its work) as you would just writing it yourself from the beginning (most of the time).

Which means that all these companies that are firing workers and demanding their remaining employees use LLMs to increase their productivity and throughput are going to find themselves in a few years with spaghettified, bug-riddled codebases that no one understands. And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.

Of course, there could be an unforeseen new order of magnitude jump. There's always the chance of that and then my prediction would be invalid. But so far, what I see is a fast approaching plateau.

icpmacdo · 9 months ago
"It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase."

SWE bench from ~30-40% to ~70-80% this year

icpmacdo commented on One-Minute Video Generation with Test-Time Training   test-time-training.github... · Posted by u/hi
icpmacdo · 10 months ago
incredible results
icpmacdo commented on O1 isn't a chat model (and that's the point)   latent.space/p/o1-skill-i... · Posted by u/gmays
geor9e · a year ago
Instead of learning the latest workarounds for the kinks and quirks of a beta AI product, I'm going to wait 3 weeks for the advice to become completely obsolete
icpmacdo · a year ago
Modern AI both shortens the useful lifespan of software and increases the importance of development speed. Waiting around doesn’t seem optimal right now.
icpmacdo commented on GPT-5 is behind schedule   wsj.com/tech/ai/openai-gp... · Posted by u/owenthejumper
overgard · a year ago
By whom? He seems highly credible to me, and his credentials check out, especially compared to hype men like Sam Altman. All youre doing is spreading FUD by an unnamed "they"
icpmacdo · a year ago
He only criticizes ai capabilities, without creating anything himself. Credentials are effectively meaningless. With every new release, he clamors for attention to prove how right he was—and always will be. That’s precisely why he lacks credibility.
icpmacdo commented on GPT-5 is behind schedule   wsj.com/tech/ai/openai-gp... · Posted by u/owenthejumper
overgard · a year ago
The article definitely has issues, but to me what's relevant is where it's published. The smart money and experts without a vested interest have been well aware LLMs are an expensive dead for over a year and have been saying as much (Gary Marcus for instance). That this is starting to enter mainstream consciousness is what's newsworthy.
icpmacdo · a year ago
Gary Marcus is continuously lambasted and not taken seriously
icpmacdo commented on GPT-5 is behind schedule   wsj.com/tech/ai/openai-gp... · Posted by u/owenthejumper
bloodyplonker22 · a year ago
I am working at an AI company that is not OpenAI. We have found ways to modularize training so we can test on narrower sets before training is "completely done". That said, I am sure there are plenty of ways others are innovating to solve the long training time problem.
icpmacdo · a year ago
This is literally just the scaling laws, "Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures"

https://arxiv.org/html/2410.11840v1#:~:text=Scaling%20laws%2....

icpmacdo commented on GPT-5 is behind schedule   wsj.com/tech/ai/openai-gp... · Posted by u/owenthejumper
arthurcolle · a year ago
Tokens don't need to be text either, you can move to higher level "take_action" semantics where "stream back 1 character to session#117" as every single function call. Training cheap models that can do things in the real world is going to change a huge amount of present capabilities over the next 10 years
icpmacdo · a year ago
can you share learning resources on this topic
icpmacdo commented on OpenAI O3 breakthrough high score on ARC-AGI-PUB   arcprize.org/blog/oai-o3-... · Posted by u/maurycy
phil917 · a year ago
Lol I missed that even though it's literally the first sentence of the blog, good catch.

Yeah, that makes this result a lot less impressive for me.

icpmacdo · a year ago
ARC co-founder Mike Knoop

"Raising visibility on this note we added to address ARC "tuned" confusion:

> OpenAI shared they trained the o3 we tested on 75% of the Public Training set.

This is the explicit purpose of the training set. It is designed to expose a system to the core knowledge priors needed to beat the much harder eval set.

The idea is each training task shows you an isolated single prior. And the eval set requires you to recombine and abstract from those priors on the fly. Broadly, the eval tasks require utilizing 3-5 priors.

The eval sets are extremely resistant to just "memorizing" the training set. This is why o3 is impressive." https://x.com/mikeknoop/status/1870583471892226343

icpmacdo commented on SpaceX Super Heavy splashes down in the gulf, canceling chopsticks landing   twitter.com/spacex/status... · Posted by u/alach11
starspangled · a year ago
They demonstrated the engines re-lighting in space, which is significant. There had been some questions about this because the engine is of a design that is said to be very tricky to start, and the tank pressurization system of the rocket has the risk of water and CO2 ice forming in the methane tanks, which had caused several failures in past tests flights. So this is a pretty good milestone.

So we might start seeing test flights actually entering orbit soon. Possibly even carrying some real payloads soon.

icpmacdo · a year ago
Why don't they already carry payloads? Is there anything worth taking up with the current expected value of it exploding ect?

u/icpmacdo

KarmaCake day2746January 29, 2013View Original