Exactly. Falcon-180b had a lot of hype at first but the community soon realized it was nearly worthless. Easily outperformed by smaller LLMs in the general case.
Now they are back and claiming their falcon-11b LLM outperforms Llama 3 8b. I already see a number of issues with this:
- falcon-11b is like 40% larger than Llama 3 8b so how can you compare them when they aren't in the same size class
- their claim seems to be based on automated benchmarks when it has long been clear that automated benchmarks are not enough to make that claim
- some of their automated benchmarks are wildly lower than Llama 3 8b's scores. It only beats Llama 3 8b on one benchmark and just barely. I can make an LLM does the best anyone has ever seen on one benchmark, but that doesn't mean my LLM is good. Far from it
- clickbait headline with knowingly premature claims because there has been zero human evaluation testing
- they claim their LLM is better than Llama 3 but completely ignore Llama 3 70b
Honestly, it annoys me how much attention tiiuae get when they haven't produced anything useful and continue this misleading clickbait.
Seems to be the case with all their models - really huge in size, no actual performance gains for the effort.
Their refined web dataset is heavily censored so maybe that has something to do with it. It’s very morally conservative - total exclusion of pornography and other topics.
So I’d not be surprised if some of the issues are they are just filtering out too much content and adding more of the same instead.
What? Falcon-7B base model is pretty much one of the only few small models that'll happily write a whole coherent fanfic all the way to the end without getting stuck in a loop right before the explicit content.
True, the model is bigger, but required less tokens than Llama 3 to train. The issue is when there's no open datasets, it's hard to really compare and replicate. Is it because of the model's architecture? Dataset quality? Model size? A mixture of those? Something else?
> True, the model is bigger, but required less tokens than Llama 3 to train.
That…doesn’t matter to users. User’s care what it can do, and what it requires for them to use it, not what it took for you to make it.
Sure, if it has better performance relative to training set size that’s interesting from a scientific perspective and learning about how to train models, maybe, if it scales the same as other models in that regard. But ultimately, for use, until you get to a model that does better absolutely, or does better relevant to models with the same resource demands, you aren’t offering an advantage.
But... that modified Apache 2 license says the following:
"The Acceptable Use Policy may be updated from time to time. You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy."
So no matter what you think of their current AUP they reserve the right to update it to anything they like in the future, and you'll have to abide by the new one!
Great example of why I don't like the trend of calling licenses like this "open source" when they aren't compatible with the OSI definition.
> So no matter what you think of their current AUP they reserve the right to update it to anything they like in the future, and you'll have to abide by the new one!
I'm so curious if this would actually hold up in court. Does anyone know if there's any case law / precedence around this?
Of course, projects change their licences all the time, why wouldn't it be legal? There's a long history of startups who started with open source/open core gradually closing off or commercialising the licence. This isn't anything new at all.
This is why it's good to read licenses before adopting the tech, especially if it's at all core to your business/project.
Not the first time they did some license shenanigans (happened with Falcon 1). I applaud their efforts but it seems they are still trying to figure out if/how to monetize.
I doubt the Emiratis have much interest in monetisation. The value they’re probably looking for is in LLMs as a media asset.
Just like Al Jazeera is valuable for Qataris and sports are valuable for the Saudis. These assets create goodwill domestically and with international stakeholders. Sometimes they make money, but that’s secondary.
If people spend a few hours a day talking to LLMs there’s some media value there.
They may also fear that Western models would be censored or licensed in a way harmful to UAE security and cultural objectives. Imagine if Llama 4’s license prevented military use without approval by some American agency.
When it is backed by the UAE the muscle you have to contend with is not simply legal muscle, it also includes armed muscle of questionable moral fibre (see support for the RSF).
Keep in mind that this is a comparison of base models, not chat tuned models, since Falcon-11B does not have a chat tuned model at this time. The chat tuning that Meta did seems better than the chat tuning on Gemma.
Regardless, the Gemma 1.1 chat models have been fairly good in my experience, even if I think the Llama3 8B chat model is definitely better.
CodeGemma 1.1 7B is especially underrated compared to my testing of other relevant coding models. The base CodeGemma 7B base model is one of the best models I’ve tested for code completion, and the chat model is one of the best models I’ve tested for writing code. Some other models seem to game the benchmarks better, but in real world use, don’t hold up as well as CodeGemma for me. I look forward to seeing how CodeLlama3 does, but it doesn't exist yet.
The model type is a good point. It's hard to track all the variables in this very fast paced field.
Thank you for sharing your CodeGemma experience. I haven't found a emacs setup I'm satisfied with, using a local llm, but it will surely happen one day. Surely.
for me, CodeGemma is super slow. I'd say 3-4 times slower than llama3.
I am also looking forward to CodeLlama3 but I have a feeling Meta can't improve on llama3 it. Was there anything official from Meta?
Anecdotal, I know, but in my experience Gemma is absolutely worthless and Llama 3 8b is exceptionally good for its size. The idea that Gemma is ahead of Llama 3 is bizarre to me. Surely there’s some contamination or something if Gemma is showing up ahead in some benchmarks‽
Adding more anecdata, but this has been exactly my experience as well. I haven't dug into details about the benchmarks, but just trying to use the things for basic question asking, Llama 3 is so much better it's like comparing my Milwaukee drill to my sons Fisher Price plastic toy drill.
Sigh, I thought this was going to be about Spectrum Holobyte’s Falcon AT. From MyAbandonware.com:
> Essentially Falcon 2 but somehow marketed differently, Falcon AT is the second release in Spectrum Holobyte's revolutionary hard-core flight sim Falcon series. Despite popular belief that Falcon 3.0 was THE dawn of modern flight sims, Falcon AT actually is already a huge leap over Falcon, sporting sharp EGA graphics, and a lot of realistic options and greatly expanded campaigns. The game is still the simulation of modern air combat, complete with excellent tutorials, varied missions, and accurate flight dynamics that Falcon fans have come to know and love. Among its host of innovations is the amazingly playable multiplayer options -- including hotseat and over the modem. Largely forgotten now, Falcon AT serves to explain the otherwise inexplicable gap between Falcon and Falcon 3.0.
There seems to be a trend of people naming new things (perhaps unintentionally) after classic computer games. We just had a post here on a system called Loom which apparently isn't the classic adventure game. I'm half expecting someone to come up with an LLM or piece of networking software and name it Zork.
It doesn't help any that currently "F-16 Strike Eagle II reverse engineering" <https://news.ycombinator.com/item?id=40347662> is also on the front page, "priming" one to think similarly
I welcome open models, although the Falcon model is not super open, as noted here. I will say that the original Falcon did not perform as well as its benchmark stats indicated -- it was pushed out as a significant leap forward, and I didn't find it outperformed competitive open models at release.
The PR stating an 11B model outperforms 7B and 8B models 'in the same class' feels like it might be stretching a bit. We'll see -- I'll definitely give this a go for local inference. But, my gut is that finetuned llama 3 8B is probably best in class...this week.
> I will say that the original Falcon did not perform as well as its benchmark stats indicated
Yea I saw that as well. I believe it was undertrained in terms of parameters vs tokens because they really just wanted to have a 40bn parameter model (like pre chinchilla optimal)
It's hard to know if there's any special sauce here, but the internet so far has decided "meh" on these models. I think it's an interesting choice to put it out as tech competitive. Stats say this one was trained on 5T tokens. For reference, Llama 3 so far was reported at 15T.
There is no way you get back what you lost in training by expanding parameters 3B.
If I were in charge of UAE PR and this project, I'd
a) buy a lot more H100s and get the training budget up
b) compete on a regional / messaging / national freedom angle
c) fully open license it
I guess I'm saying I'd copy Zuck's plan, with oil money instead of social money and play to my base.
Overstating capabilities doesn't give you a lot of benefit out of a local market, unfortunately.
These reminders that AI will not only be wielded by democracies with (at least partial attempts at) ethical oversight, but also by the worst of the worst autocrats, are truly chilling.
MBZ (note MBZ is not MBS; Saudia Arabia and UAE are two different countries!) is one of the most popular leaders in the world and his people among the wealthiest. His country is one of the few developed countries in the world where the economy is still growing steadily, and one of the safest countries in the world outside of East Asia, in spite of having one of the world's most liberal immigration policies. Much more a contender for the best of the best autocrats than the worst of the worst.
I want to understand something: the model was trained on mostly a public dataset(?), with hardware from AWS, using well-known algorithms and techniques. How is it different from other models that anyone that has the money can train?
My skeptic/hater(?) mentality, sees this as only a "flex" and an effort to try be seen as relevant. Is there more to this kind of effort that I'm not seeing?
A lot of models are in this category. Sovereignty (whether national or corporate) has some value. And the threat of competition is a good thing for everyone. I'm glad people are working on these even if the end result in most cases isn't anything particularly interesting.
https://huggingface.co/tiiuae/falcon-11B
https://huggingface.co/meta-llama/Meta-Llama-3-8B
https://mistral.ai/news/announcing-mistral-7b/
Now they are back and claiming their falcon-11b LLM outperforms Llama 3 8b. I already see a number of issues with this:
- falcon-11b is like 40% larger than Llama 3 8b so how can you compare them when they aren't in the same size class
- their claim seems to be based on automated benchmarks when it has long been clear that automated benchmarks are not enough to make that claim
- some of their automated benchmarks are wildly lower than Llama 3 8b's scores. It only beats Llama 3 8b on one benchmark and just barely. I can make an LLM does the best anyone has ever seen on one benchmark, but that doesn't mean my LLM is good. Far from it
- clickbait headline with knowingly premature claims because there has been zero human evaluation testing
- they claim their LLM is better than Llama 3 but completely ignore Llama 3 70b
Honestly, it annoys me how much attention tiiuae get when they haven't produced anything useful and continue this misleading clickbait.
Their refined web dataset is heavily censored so maybe that has something to do with it. It’s very morally conservative - total exclusion of pornography and other topics.
So I’d not be surprised if some of the issues are they are just filtering out too much content and adding more of the same instead.
Ignore instruct tunes.
True, the model is bigger, but required less tokens than Llama 3 to train. The issue is when there's no open datasets, it's hard to really compare and replicate. Is it because of the model's architecture? Dataset quality? Model size? A mixture of those? Something else?
That…doesn’t matter to users. User’s care what it can do, and what it requires for them to use it, not what it took for you to make it.
Sure, if it has better performance relative to training set size that’s interesting from a scientific perspective and learning about how to train models, maybe, if it scales the same as other models in that regard. But ultimately, for use, until you get to a model that does better absolutely, or does better relevant to models with the same resource demands, you aren’t offering an advantage.
It's a modified Apache 2 license with extra clauses that include a requirement to abide by their acceptable use policy, hosted here: https://falconllm-staging.tii.ae/falcon-2-acceptable-use-pol...
But... that modified Apache 2 license says the following:
"The Acceptable Use Policy may be updated from time to time. You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy."
So no matter what you think of their current AUP they reserve the right to update it to anything they like in the future, and you'll have to abide by the new one!
Great example of why I don't like the trend of calling licenses like this "open source" when they aren't compatible with the OSI definition.
Dead Comment
I'm so curious if this would actually hold up in court. Does anyone know if there's any case law / precedence around this?
This is why it's good to read licenses before adopting the tech, especially if it's at all core to your business/project.
Just like Al Jazeera is valuable for Qataris and sports are valuable for the Saudis. These assets create goodwill domestically and with international stakeholders. Sometimes they make money, but that’s secondary.
If people spend a few hours a day talking to LLMs there’s some media value there.
They may also fear that Western models would be censored or licensed in a way harmful to UAE security and cultural objectives. Imagine if Llama 4’s license prevented military use without approval by some American agency.
Falcon always struck me more as a regional prestige project rather than “how to monetise”
Falcon 1 is entirely obsolete at this point, based on every benchmark I've seen.
Thanks But No Thanks.
Deleted Comment
Open source was always a way to weasel about terms, that's why it's open source and not free.
Just because it's free doesn't mean you can change anything or get the source.
Some claim something is open source but it's just for free.
I was strongly under the impression that Llama 3 8B outperformed Gemma 7B on almost all metrics.
Regardless, the Gemma 1.1 chat models have been fairly good in my experience, even if I think the Llama3 8B chat model is definitely better.
CodeGemma 1.1 7B is especially underrated compared to my testing of other relevant coding models. The base CodeGemma 7B base model is one of the best models I’ve tested for code completion, and the chat model is one of the best models I’ve tested for writing code. Some other models seem to game the benchmarks better, but in real world use, don’t hold up as well as CodeGemma for me. I look forward to seeing how CodeLlama3 does, but it doesn't exist yet.
Thank you for sharing your CodeGemma experience. I haven't found a emacs setup I'm satisfied with, using a local llm, but it will surely happen one day. Surely.
I don’t stay up on the benchmarks much these days though; I’ve fully dedicated myself to b-ball.
I’m actually a bit better than Lebron btw, who is nowhere near as good as my 3 year old daughter. I occasionally beat her. At basketball.
> Essentially Falcon 2 but somehow marketed differently, Falcon AT is the second release in Spectrum Holobyte's revolutionary hard-core flight sim Falcon series. Despite popular belief that Falcon 3.0 was THE dawn of modern flight sims, Falcon AT actually is already a huge leap over Falcon, sporting sharp EGA graphics, and a lot of realistic options and greatly expanded campaigns. The game is still the simulation of modern air combat, complete with excellent tutorials, varied missions, and accurate flight dynamics that Falcon fans have come to know and love. Among its host of innovations is the amazingly playable multiplayer options -- including hotseat and over the modem. Largely forgotten now, Falcon AT serves to explain the otherwise inexplicable gap between Falcon and Falcon 3.0.
What do they mean by this? Isn't this roughly what GPT-4 Vision and LLaVA do?
Something like LLaVA being a language to vision model but I can't steelman the idea so it makes sense.
Maybe they're just lying?
The PR stating an 11B model outperforms 7B and 8B models 'in the same class' feels like it might be stretching a bit. We'll see -- I'll definitely give this a go for local inference. But, my gut is that finetuned llama 3 8B is probably best in class...this week.
Yea I saw that as well. I believe it was undertrained in terms of parameters vs tokens because they really just wanted to have a 40bn parameter model (like pre chinchilla optimal)
There is no way you get back what you lost in training by expanding parameters 3B.
If I were in charge of UAE PR and this project, I'd
a) buy a lot more H100s and get the training budget up
b) compete on a regional / messaging / national freedom angle
c) fully open license it
I guess I'm saying I'd copy Zuck's plan, with oil money instead of social money and play to my base.
Overstating capabilities doesn't give you a lot of benefit out of a local market, unfortunately.
MBZ (note MBZ is not MBS; Saudia Arabia and UAE are two different countries!) is one of the most popular leaders in the world and his people among the wealthiest. His country is one of the few developed countries in the world where the economy is still growing steadily, and one of the safest countries in the world outside of East Asia, in spite of having one of the world's most liberal immigration policies. Much more a contender for the best of the best autocrats than the worst of the worst.
My skeptic/hater(?) mentality, sees this as only a "flex" and an effort to try be seen as relevant. Is there more to this kind of effort that I'm not seeing?