The Llama 4 launch looks like a real debacle for Meta. The model doesn't look great. All the coverage I've seen has been negative.
This is about what I expected, but it makes you wonder what they're going to do next. At this point it looks like they are falling behind the other open models, and made an ambitious bet on MoEs, without this paying off.
Did Zuck push for the release? I'm sure they knew it wasn't ready yet.
I don't know about Llama 4. Competition is intense in this field so you can't expect everybody to be number 1. However, I think the performance culture at Meta is counterproductive. Incentives are misaligned, I hope leadership will try to improve it.
Employees are encouraged to ship half-baked features and move to another project. Quality isn't rewarded at all. The recent layoffs have made things even worse. Skilled people were fired, slowing down teams. I assume the goal was to push remaining employees to work even more, but I doubt this is working.
I haven't worked in enough companies of this size to be able to tell if alternatives are better, but it's very clear to me that Meta doesn't get the best from their employees.
For those who haven't heard of it, "The Hawthorne Effect" is the name given to a phenomena where when a person or group being studied is aware they are being studied, their performance goes up but as much as 50% for 4-8 weeks, then regresses to its norm.
This is true if they are just being observed, or if some novel new processes are introduced. If the new things are beneficial, the performance rises for 4-8 weeks as usual, but when it regresses it regresses to a higher performance reflecting the value of the new process.
But when poor management introduce a counter-productive change, the Hawthorne Effect makes it look like a resounding success for 4-8 weeks. Then the effect fades, and performance drops below the original level. Sufficiently devious managers either move on to new projects or blame the workers for failing to maintain the new higher pace of performance.
This explains a lot of the incentive for certain types of leaders to champion arbitrary changes, take a victory lap, and then disassociate themselves from accountability for the long-term success or failure of their initiative.
(There is quite a bit of controversy over what the mechanisms for the Hawthorne Effect are, and whether change alone can introduce it for whether participants need to feel they are being observed, but the model as I see it fits my anecdotal experience where new processes are always accompanied by attempts to meet new performance goals, and everyone is extremely aware that the outcome is being measured.)
is really a bad concept in this space, where you get limited shots at releasing something that generates interest.
> Employees are encouraged to ship half-baked features
And this is why I never liked that motto and have always pushed back at startups where I was hired that embraced this line of thought.
Quality matters. It's context-dependent, so sometimes it matters a lot, and sometimes hardly. But "moving fast and breaking things" should be a deliberate choice, made for every feature, module, sprint, story all over again, IMO. If at all.
I agree. I think of it like a car engine. You can push it up to a certain RPM and it will keep making more and more power. Above that RPM, the engine starts to produce less power and eventually blows a gasket.
I think the performance-based management worked for a while because there were some gains to be had by pushing people harder. However, they’ve gone past that and are now pushing people too hard and getting worse results. Every machine has its operating limits and an area where it operates most efficiently. A company is no different.
You're not encouraged per se to ship half-baked features, but if you don't have enough "impact" at the end of the half (for mid cycle checkin) or year (for full PSC cycle) then you're going to get "Below Expectations" and then "Meets Most" (or worse) and with the current environment a swift offboarding.
When I was there (working in integrity) our group of staff+ engineers opined how it led to perverse incentives - and whilst you can work there and do great work, and get good ratings, I saw too many examples of "optimizing for PSC" (otherwise known as PSC hacking).
It's also terrible output, even before you consider what looks like catastrophic forgetting from crappy RL. The emoji use and writing style make me want to suck-start a revolver. I don't know how they expect anyone to actually use it.
> Employees are encouraged to ship half-baked features and move to another project
Maybe there is more to that. It's been more than a year since Llama 3 was released. That should be enough time for Meta to release something with significantly improvement. Or you mean quarter by quarter the engineers had to show that they were making impact in their perf review, which could be detrimental to the Llama 4 project?
Another thing that puzzles me is that again and again we see that the quality of a model can improve if we have more high-quality data, yet can't Meta manage to secure massive amount of new high-quality data to boost their model performance?
Yep they've basically created a culture where people are incentivized to look busy, ship things fast, and look out for themselves. Which attracts and retains people that thrive in that kind of environment.
That's a terrible way to execute on big, ambitious projects since it discourages risky bets and discourages collaboration.
It's not a big deal. Llama 4 feels like a flop because the expectations are really high based on their previous releases and the sense of momentum in the ecosystem because of DeepSeek. At the end of the day, LLama 4 didn't meet the elevated expectations, but they're fine. They'll continue to improve and iterate and maybe the next one will be more hype worthy, or maybe expectations will be readjusted as the specter of diminishing returns continues to creep in.
It feels like a flop because it is objectively worse than models many times smaller that shipped sometimes ago. In fact, it is worse than earlier LLaMA releases on may points. It's so bad that people who initially ran into it assumed that the downloaded weights must be corrupted somehow.
The switching costs are so low (zero) that anyone using these models just jumps to the best performer. I also agree that this is not a brand or narrative sensitive project.
> it makes you wonder what they're going to do next
They're just gonna keep throwing money at it. This is a hobby and talent magnet for them, instagram is the money printer. They've been working on VR for like a decade with barely much results in terms of users (compared to costs). This will be no different.
Both are also decent long-terms bets. Being VR market leader now means they will be VR market-leader with plenty of inhouse talent and IP when the technology matures and the market grows. Being in the AI race, even if they are not leading, means they have in-house talent and technology to be able to react to wherever the market is going with AI. They have one of the biggest messengers and one of the biggest image-posting sites, there is a decent chance AI will become important to them in some not-yet-obvious way.
One of Meta's biggest strengths is Zuckerberg being able to play these kinds of bets. Those bets being great for PR and talent acquisition is the cherry on top
I remember reading that they were in panic mode when the DeepSeek model came out so they must have scrambled and had to re-work a lot of things since DeepSeek was so competitive and open source as well
Fear of R2 looms large as well. I suspect they succumbed to the nuance collapse along the lines of “Is double checking results worth it if DeepSeek eats our lunch?”
Do you know that they made a bet on MoE? Meaning they abandonded dense models? I doubt that is the case. Just releasing MoE Llama 4 does not constitute a "bet" without further information.
Also from what I can tell this performs better than models with parameter counts equal to one expert, and worse than fully dense models equal to total parameter count. Isn't that kind of what we'd expect? in what way is that a failure?
Maybe I am missing some details. But it feels like you have an axe to grind.
A 4x8 MOE performs better than an 8B but worse than a 32B, is your statement?
My response would be, "so why bother with MOE?"
However deepseek r1 is MOE from my understanding, but the "E" are all =>32B parameters. There's > 20 experts. I could be misinformed; however, even so, I'd say a MOE with 32B or even 70B experts will outperform (define this!) Models with equal parameter counts, because deepseek outperforms (define?) ChatGPT et al.
I'm just shocked that the companies who stole all kinds of copyrighted material would again do something unethical to keep the bubble and gravy train going...
Yes, their worst fear is people figuring out that an AI chatbot is a strict librarian that spits out quotes but doesn't let you enter the library (the AI model itself). Because with 3D game-like UIs people can enter the library and see all their stolen personal photos (if they were ever online), all kind of monsters. It'll be all over YouTube.
I think it's most illustrative to see the sample battles (H2H) that LMArena released [1]. The outputs of Meta's model is too verbose and too 'yappy' IMO. And looking at the verdicts, it's no wonder by people are discounting LMArena rankings.
People have been gaming ML benchmarks as long as there have been ML benchmarks. That's why it's better to see if other researchers are incorporating a technique into their actual models rather than 'is this paper the bold entry in a benchmark table'. But it takes longer.
“Got caught” is a misleading way to present what happened.
According to the article, Meta publicly stated, right below the benchmark comparison, that the version of Llama on LMArena was the experimental chat version:
> According to Meta’s own materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality”
The AI benchmark in question, LMArena, compares Llama 4 experimental to closed models like ChatGPT 4o latest, and Llama performs better (https://lmarena.ai/?leaderboard).
There are almost certainly ways to fine-tune the model in ways that make it perform better on the Arena, but perform worse in other benchmarks or in practice. Usually that's not a good trade-off. What's being suggested here is that Meta is running such a fine-tuned version on the Arena (and reporting those numbers) while running models with different fine-tuning on other benchmarks (and reporting those numbers), while giving the appearance that those are actually the same models.
It can be easily gamed. The users are self-selected, and they have zero incentive to be honest or rigorous or provide good responses. Some have incentives the opposite way. (There was a report of a prediction market user who said they had won a market on Gemini models by manipulating the votes; LMArena swore furiously there had definitely been no manipulation but was conspicuously silent on any details.) And the release of more LMArena responses has shown that a lot of the user ratings are blatantly wrong: either they're basically fraudulent, or LMArena's current users are people whose ratings you should be optimizing against because they are so ignorant, lazy, and superficial.
At this point, when I look at the output from my Gemini-2.5-pro sessions, they are so high quality, and take so long to read, and check, and have an informed opinion on, I just can't trust the slapdash approach of LMArena in assuming that careless driveby maybe-didn't-even-read-the-responses-ain't-no-one-got-time-for-that-nerd-shit ratings mean much of anything. There have been red flags in the past and I've been taking them ever less seriously even as one of many benchmarks since early last year, but maybe this is the biggest backfire yet. And it's only going to get worse. At this rate, without major overhaul, you should take being #1 on LMArena seriously as useful and important news - as a reason to not use a model.
It's past time for LMArena people to sit down and have some thorough reflection on whether it is still worth running at all, and at what point they are doing more harm than good. No benchmark lives forever and it is normal and healthy to shut them down at some point after having been saturated, but some manage to live a lot longer than they should have...
I guess I can't really refute your experience or observations. But, just a single anecdotal point; I use the arena's voting feature quite a bit and I try really hard to vote on the "best" answer. I've got no clue if the majority of the voters put the same level of effort into it or not, but I figure that an honest and rigorous vote is the least I can do in return for a service provided to me free with no obnoxious ads. There's a nonzero incentive to do the right thing, but it's hard to say where it comes from.
As an aside, I like getting two responses from two models that I can compare against one another (and with the primary sources of truth that I know a priori). Not only does that help me sanity-check the responses somewhat, but I get to interact with new models that I wouldn't have otherwise had the opportunity to. Learning new stuff is good, and being earnest is good.
LMArena was always junk. I work in this space and while the media takes it seriously most scientists don't.
Random people ask random stuff and then it measures how good they feel. This is only a worthwhile evaluation if you're Google or Meta or OpenAI and you need to make a chartbot that keeps people coming back. It doesn't measure anything else useful.
I hear AI news from time to time from the M5M in the US - and the only place I've ever seen "LMArena" is on HN and in the LM studio discord. At a ratio of 5:1 at least.
Conversation is a two-way street. A good conversation mechanic could elicit better interaction from the users and result in better answers. Stands to reason, anyway.
In one of karpathys videos he said that he was a bit suspicious that the models that score the highest in LMarena aren't the ones that people use the most to solve actual day to day problems.
Ahmad al-Dahle, who leads "Gen AI" at Meta, wrote this on Twitter:
... We're also hearing some reports of mixed quality across different services ...
We've also heard claims that we trained on test sets -- that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.
We believe the Llama 4 models are a significant advancement and we're looking forward to working with the community to unlock their value.
There seems to be a lot of haloo, accusations, and rumors, but little meat to any of them. Maybe they rushed the release, were unsure of which one to go with, and some moderate rule bending in terms of which tune got sent to the arena, but I have seen no real hard evidence of real hard underhandedness.
I believe this was designed to flatter the prompter more / be more ingratiating. Which is a worry if true (what it says about the people doing the comparing).
This is about what I expected, but it makes you wonder what they're going to do next. At this point it looks like they are falling behind the other open models, and made an ambitious bet on MoEs, without this paying off.
Did Zuck push for the release? I'm sure they knew it wasn't ready yet.
Employees are encouraged to ship half-baked features and move to another project. Quality isn't rewarded at all. The recent layoffs have made things even worse. Skilled people were fired, slowing down teams. I assume the goal was to push remaining employees to work even more, but I doubt this is working.
I haven't worked in enough companies of this size to be able to tell if alternatives are better, but it's very clear to me that Meta doesn't get the best from their employees.
This is true if they are just being observed, or if some novel new processes are introduced. If the new things are beneficial, the performance rises for 4-8 weeks as usual, but when it regresses it regresses to a higher performance reflecting the value of the new process.
But when poor management introduce a counter-productive change, the Hawthorne Effect makes it look like a resounding success for 4-8 weeks. Then the effect fades, and performance drops below the original level. Sufficiently devious managers either move on to new projects or blame the workers for failing to maintain the new higher pace of performance.
This explains a lot of the incentive for certain types of leaders to champion arbitrary changes, take a victory lap, and then disassociate themselves from accountability for the long-term success or failure of their initiative.
(There is quite a bit of controversy over what the mechanisms for the Hawthorne Effect are, and whether change alone can introduce it for whether participants need to feel they are being observed, but the model as I see it fits my anecdotal experience where new processes are always accompanied by attempts to meet new performance goals, and everyone is extremely aware that the outcome is being measured.)
> Move fast and break things
is really a bad concept in this space, where you get limited shots at releasing something that generates interest.
> Employees are encouraged to ship half-baked features
And this is why I never liked that motto and have always pushed back at startups where I was hired that embraced this line of thought. Quality matters. It's context-dependent, so sometimes it matters a lot, and sometimes hardly. But "moving fast and breaking things" should be a deliberate choice, made for every feature, module, sprint, story all over again, IMO. If at all.
I think the performance-based management worked for a while because there were some gains to be had by pushing people harder. However, they’ve gone past that and are now pushing people too hard and getting worse results. Every machine has its operating limits and an area where it operates most efficiently. A company is no different.
You're not encouraged per se to ship half-baked features, but if you don't have enough "impact" at the end of the half (for mid cycle checkin) or year (for full PSC cycle) then you're going to get "Below Expectations" and then "Meets Most" (or worse) and with the current environment a swift offboarding.
When I was there (working in integrity) our group of staff+ engineers opined how it led to perverse incentives - and whilst you can work there and do great work, and get good ratings, I saw too many examples of "optimizing for PSC" (otherwise known as PSC hacking).
Maybe there is more to that. It's been more than a year since Llama 3 was released. That should be enough time for Meta to release something with significantly improvement. Or you mean quarter by quarter the engineers had to show that they were making impact in their perf review, which could be detrimental to the Llama 4 project?
Another thing that puzzles me is that again and again we see that the quality of a model can improve if we have more high-quality data, yet can't Meta manage to secure massive amount of new high-quality data to boost their model performance?
That's a terrible way to execute on big, ambitious projects since it discourages risky bets and discourages collaboration.
They're just gonna keep throwing money at it. This is a hobby and talent magnet for them, instagram is the money printer. They've been working on VR for like a decade with barely much results in terms of users (compared to costs). This will be no different.
One of Meta's biggest strengths is Zuckerberg being able to play these kinds of bets. Those bets being great for PR and talent acquisition is the cherry on top
All I know is it is the first Llama release since Zuck brought "masculine energy" back to Meta.
Also from what I can tell this performs better than models with parameter counts equal to one expert, and worse than fully dense models equal to total parameter count. Isn't that kind of what we'd expect? in what way is that a failure?
Maybe I am missing some details. But it feels like you have an axe to grind.
My response would be, "so why bother with MOE?"
However deepseek r1 is MOE from my understanding, but the "E" are all =>32B parameters. There's > 20 experts. I could be misinformed; however, even so, I'd say a MOE with 32B or even 70B experts will outperform (define this!) Models with equal parameter counts, because deepseek outperforms (define?) ChatGPT et al.
Dead Comment
Imagine this but you remove the noise and can walk like in an art gallery (it's a diffusion model but LLMs can be loosely converted into 3D maps with objects, too): https://writings.stephenwolfram.com/2023/07/generative-ai-sp...
[1]: https://huggingface.co/spaces/lmarena-ai/Llama-4-Maverick-03...
According to the article, Meta publicly stated, right below the benchmark comparison, that the version of Llama on LMArena was the experimental chat version:
> According to Meta’s own materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality”
The AI benchmark in question, LMArena, compares Llama 4 experimental to closed models like ChatGPT 4o latest, and Llama performs better (https://lmarena.ai/?leaderboard).
I thought there was an aspect where you run two models on the same user-supplied query. Surely this can't be gamed?
> “optimized for conversationality”
I don't understand what that means - how it gives it an LMArena advantage.
At this point, when I look at the output from my Gemini-2.5-pro sessions, they are so high quality, and take so long to read, and check, and have an informed opinion on, I just can't trust the slapdash approach of LMArena in assuming that careless driveby maybe-didn't-even-read-the-responses-ain't-no-one-got-time-for-that-nerd-shit ratings mean much of anything. There have been red flags in the past and I've been taking them ever less seriously even as one of many benchmarks since early last year, but maybe this is the biggest backfire yet. And it's only going to get worse. At this rate, without major overhaul, you should take being #1 on LMArena seriously as useful and important news - as a reason to not use a model.
It's past time for LMArena people to sit down and have some thorough reflection on whether it is still worth running at all, and at what point they are doing more harm than good. No benchmark lives forever and it is normal and healthy to shut them down at some point after having been saturated, but some manage to live a lot longer than they should have...
As an aside, I like getting two responses from two models that I can compare against one another (and with the primary sources of truth that I know a priori). Not only does that help me sanity-check the responses somewhat, but I get to interact with new models that I wouldn't have otherwise had the opportunity to. Learning new stuff is good, and being earnest is good.
_nick
Random people ask random stuff and then it measures how good they feel. This is only a worthwhile evaluation if you're Google or Meta or OpenAI and you need to make a chartbot that keeps people coming back. It doesn't measure anything else useful.
(Last I checked 1 month ago)