Falcon 2 - Readit News

Their benchmark results seem roughly on par with Mistral 7B and Llama 3 8B, which hardly seems that great given the increase in model size.

nabakin · a year ago

Exactly. Falcon-180b had a lot of hype at first but the community soon realized it was nearly worthless. Easily outperformed by smaller LLMs in the general case.

Now they are back and claiming their falcon-11b LLM outperforms Llama 3 8b. I already see a number of issues with this:

- falcon-11b is like 40% larger than Llama 3 8b so how can you compare them when they aren't in the same size class

- their claim seems to be based on automated benchmarks when it has long been clear that automated benchmarks are not enough to make that claim

- some of their automated benchmarks are wildly lower than Llama 3 8b's scores. It only beats Llama 3 8b on one benchmark and just barely. I can make an LLM does the best anyone has ever seen on one benchmark, but that doesn't mean my LLM is good. Far from it

- clickbait headline with knowingly premature claims because there has been zero human evaluation testing

- they claim their LLM is better than Llama 3 but completely ignore Llama 3 70b

Honestly, it annoys me how much attention tiiuae get when they haven't produced anything useful and continue this misleading clickbait.

hehdhdjehehegwv · a year ago

Seems to be the case with all their models - really huge in size, no actual performance gains for the effort.

Their refined web dataset is heavily censored so maybe that has something to do with it. It’s very morally conservative - total exclusion of pornography and other topics.

So I’d not be surprised if some of the issues are they are just filtering out too much content and adding more of the same instead.

kaetemi · a year ago

What? Falcon-7B base model is pretty much one of the only few small models that'll happily write a whole coherent fanfic all the way to the end without getting stuck in a loop right before the explicit content.

Ignore instruct tunes.

marci · a year ago

Maybe that's not the right metrics to compare.

True, the model is bigger, but required less tokens than Llama 3 to train. The issue is when there's no open datasets, it's hard to really compare and replicate. Is it because of the model's architecture? Dataset quality? Model size? A mixture of those? Something else?

dragonwriter · a year ago

> True, the model is bigger, but required less tokens than Llama 3 to train.

That…doesn’t matter to users. User’s care what it can do, and what it requires for them to use it, not what it took for you to make it.

Sure, if it has better performance relative to training set size that’s interesting from a scientific perspective and learning about how to train models, maybe, if it scales the same as other models in that regard. But ultimately, for use, until you get to a model that does better absolutely, or does better relevant to models with the same resource demands, you aren’t offering an advantage.

The license is not good: https://falconllm-staging.tii.ae/falcon-2-terms-and-conditio...

It's a modified Apache 2 license with extra clauses that include a requirement to abide by their acceptable use policy, hosted here: https://falconllm-staging.tii.ae/falcon-2-acceptable-use-pol...

But... that modified Apache 2 license says the following:

"The Acceptable Use Policy may be updated from time to time. You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy."

So no matter what you think of their current AUP they reserve the right to update it to anything they like in the future, and you'll have to abide by the new one!

Great example of why I don't like the trend of calling licenses like this "open source" when they aren't compatible with the OSI definition.

JimDabell · a year ago

So basically you can never use this for anything non-trivial because they can deny your use-case at any time without even notifying you.

worldsayshi · a year ago

Are licenses that can be altered after the fact even legal? Feels like more companies would use them if it was...

candiddevmike · a year ago

How will they know you're using it?

Dead Comment

abhorrence · a year ago

> So no matter what you think of their current AUP they reserve the right to update it to anything they like in the future, and you'll have to abide by the new one!

I'm so curious if this would actually hold up in court. Does anyone know if there's any case law / precedence around this?

davedx · a year ago

Of course, projects change their licences all the time, why wouldn't it be legal? There's a long history of startups who started with open source/open core gradually closing off or commercialising the licence. This isn't anything new at all.

This is why it's good to read licenses before adopting the tech, especially if it's at all core to your business/project.

binarymax · a year ago

Not the first time they did some license shenanigans (happened with Falcon 1). I applaud their efforts but it seems they are still trying to figure out if/how to monetize.

spacebanana7 · a year ago

I doubt the Emiratis have much interest in monetisation. The value they’re probably looking for is in LLMs as a media asset.

Just like Al Jazeera is valuable for Qataris and sports are valuable for the Saudis. These assets create goodwill domestically and with international stakeholders. Sometimes they make money, but that’s secondary.

If people spend a few hours a day talking to LLMs there’s some media value there.

They may also fear that Western models would be censored or licensed in a way harmful to UAE security and cultural objectives. Imagine if Llama 4’s license prevented military use without approval by some American agency.

Havoc · a year ago

They’ve got Saudi oil money behind them no?

Falcon always struck me more as a regional prestige project rather than “how to monetise”

Havoc · a year ago

The 40b model appears to be pure apache though

coder543 · a year ago

That is Falcon 1, not Falcon 2.

Falcon 1 is entirely obsolete at this point, based on every benchmark I've seen.

cs702 · a year ago

Well, that really sucks.

Thanks But No Thanks.

Deleted Comment

tantalor · a year ago

It's probably unenforceable

SiempreViernes · a year ago

When it is backed by the UAE the muscle you have to contend with is not simply legal muscle, it also includes armed muscle of questionable moral fibre (see support for the RSF).

Perz1val · a year ago

Maybe, but would you risk trying?

davedx · a year ago

How is changing a software licence unenforceable?

darby_eight · a year ago

> Great example of why I don't like the trend of calling licenses like this "open source" when they aren't compatible with the OSI definition.

Open source was always a way to weasel about terms, that's why it's open source and not free.

croes · a year ago

More the other way around.

Just because it's free doesn't mean you can change anything or get the source.

Some claim something is open source but it's just for free.

dongobread · a year ago

simonw · a year ago

Hugsun · a year ago

> New Falcon 2 11B Outperforms Meta’s Llama 3 8B, and Performs on par with leading Google Gemma 7B Model

I was strongly under the impression that Llama 3 8B outperformed Gemma 7B on almost all metrics.

Keep in mind that this is a comparison of base models, not chat tuned models, since Falcon-11B does not have a chat tuned model at this time. The chat tuning that Meta did seems better than the chat tuning on Gemma.

Regardless, the Gemma 1.1 chat models have been fairly good in my experience, even if I think the Llama3 8B chat model is definitely better.

CodeGemma 1.1 7B is especially underrated compared to my testing of other relevant coding models. The base CodeGemma 7B base model is one of the best models I’ve tested for code completion, and the chat model is one of the best models I’ve tested for writing code. Some other models seem to game the benchmarks better, but in real world use, don’t hold up as well as CodeGemma for me. I look forward to seeing how CodeLlama3 does, but it doesn't exist yet.

The model type is a good point. It's hard to track all the variables in this very fast paced field.

Thank you for sharing your CodeGemma experience. I haven't found a emacs setup I'm satisfied with, using a local llm, but it will surely happen one day. Surely.

attentive · a year ago

for me, CodeGemma is super slow. I'd say 3-4 times slower than llama3. I am also looking forward to CodeLlama3 but I have a feeling Meta can't improve on llama3 it. Was there anything official from Meta?

Anecdotal, I know, but in my experience Gemma is absolutely worthless and Llama 3 8b is exceptionally good for its size. The idea that Gemma is ahead of Llama 3 is bizarre to me. Surely there’s some contamination or something if Gemma is showing up ahead in some benchmarks‽

freedomben · a year ago

Adding more anecdata, but this has been exactly my experience as well. I haven't dug into details about the benchmarks, but just trying to use the things for basic question asking, Llama 3 is so much better it's like comparing my Milwaukee drill to my sons Fisher Price plastic toy drill.

7thpower · a year ago

I found that curious too.

I don’t stay up on the benchmarks much these days though; I’ve fully dedicated myself to b-ball.

I’m actually a bit better than Lebron btw, who is nowhere near as good as my 3 year old daughter. I occasionally beat her. At basketball.

hypertexthero · a year ago

Sigh, I thought this was going to be about Spectrum Holobyte’s Falcon AT. From MyAbandonware.com:

> Essentially Falcon 2 but somehow marketed differently, Falcon AT is the second release in Spectrum Holobyte's revolutionary hard-core flight sim Falcon series. Despite popular belief that Falcon 3.0 was THE dawn of modern flight sims, Falcon AT actually is already a huge leap over Falcon, sporting sharp EGA graphics, and a lot of realistic options and greatly expanded campaigns. The game is still the simulation of modern air combat, complete with excellent tutorials, varied missions, and accurate flight dynamics that Falcon fans have come to know and love. Among its host of innovations is the amazingly playable multiplayer options -- including hotseat and over the modem. Largely forgotten now, Falcon AT serves to explain the otherwise inexplicable gap between Falcon and Falcon 3.0.

jhbadger · a year ago

There seems to be a trend of people naming new things (perhaps unintentionally) after classic computer games. We just had a post here on a system called Loom which apparently isn't the classic adventure game. I'm half expecting someone to come up with an LLM or piece of networking software and name it Zork.

mdaniel · a year ago

It doesn't help any that currently "F-16 Strike Eagle II reverse engineering" <https://news.ycombinator.com/item?id=40347662> is also on the front page, "priming" one to think similarly

jl6 · a year ago

"only AI Model with Vision-to-Language Capabilities"

What do they mean by this? Isn't this roughly what GPT-4 Vision and LLaVA do?

At first I thought they were playing some semantic game.

Something like LLaVA being a language to vision model but I can't steelman the idea so it makes sense.

Maybe they're just lying?

tictacttoe · a year ago

And all Claude models…

Me1000 · a year ago

And Gemini.

vessenes · a year ago

I welcome open models, although the Falcon model is not super open, as noted here. I will say that the original Falcon did not perform as well as its benchmark stats indicated -- it was pushed out as a significant leap forward, and I didn't find it outperformed competitive open models at release.

The PR stating an 11B model outperforms 7B and 8B models 'in the same class' feels like it might be stretching a bit. We'll see -- I'll definitely give this a go for local inference. But, my gut is that finetuned llama 3 8B is probably best in class...this week.

htrp · a year ago

> I will say that the original Falcon did not perform as well as its benchmark stats indicated

Yea I saw that as well. I believe it was undertrained in terms of parameters vs tokens because they really just wanted to have a 40bn parameter model (like pre chinchilla optimal)

It's hard to know if there's any special sauce here, but the internet so far has decided "meh" on these models. I think it's an interesting choice to put it out as tech competitive. Stats say this one was trained on 5T tokens. For reference, Llama 3 so far was reported at 15T.

There is no way you get back what you lost in training by expanding parameters 3B.

If I were in charge of UAE PR and this project, I'd

a) buy a lot more H100s and get the training budget up

b) compete on a regional / messaging / national freedom angle

c) fully open license it

I guess I'm saying I'd copy Zuck's plan, with oil money instead of social money and play to my base.

Overstating capabilities doesn't give you a lot of benefit out of a local market, unfortunately.

j-pb · a year ago

These reminders that AI will not only be wielded by democracies with (at least partial attempts at) ethical oversight, but also by the worst of the worst autocrats, are truly chilling.

logicchains · a year ago

>but also by the worst of the worst autocrats

MBZ (note MBZ is not MBS; Saudia Arabia and UAE are two different countries!) is one of the most popular leaders in the world and his people among the wealthiest. His country is one of the few developed countries in the world where the economy is still growing steadily, and one of the safest countries in the world outside of East Asia, in spite of having one of the world's most liberal immigration policies. Much more a contender for the best of the best autocrats than the worst of the worst.

redleader55 · a year ago

I want to understand something: the model was trained on mostly a public dataset(?), with hardware from AWS, using well-known algorithms and techniques. How is it different from other models that anyone that has the money can train?

My skeptic/hater(?) mentality, sees this as only a "flex" and an effort to try be seen as relevant. Is there more to this kind of effort that I'm not seeing?

andy99 · a year ago

A lot of models are in this category. Sovereignty (whether national or corporate) has some value. And the threat of competition is a good thing for everyone. I'm glad people are working on these even if the end result in most cases isn't anything particularly interesting.