Readit News logoReadit News
adzm · 2 months ago
For those curious, the 24 official languages are Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish.

Maltese, interestingly, is the only Afro-Asiatic derived language.

Hungarian, Finnish, and Estonian are the three Uralic languages.

All the others are Indo-European, Greek being the only Hellenic one, Irish the only Celtic, the rest are Baltic, Slavic, Italic, or Germanic.

(I originally used the term Balto-Slavic, though I was unaware of some of the connotations of that term until just now. Baltic and Slavic do share a common origin, but that was a very very long time ago)

arbuge · 2 months ago
> Maltese, interestingly, is the only Afro-Asiatic derived language.

It's Semitic, to be precise.

https://en.wikipedia.org/wiki/Semitic_languages

UebVar · 2 months ago
Arabic, even. An outlier, as it is AFAIK the only arabic dialect that is not written with the arabic alphabet. Also it's far removed from other arabic dialects.
Vinnl · 2 months ago
Tomorrow there are elections in the Netherlands, and two parties are proposing adding Frysian to that list: https://neerlandistiek.nl/2025/10/kies-voor-taal/

Best get to retraining those models.

tecleandor · 2 months ago
AFAIK, they are trying to get Frisian added to the "European Charter for Regional or Minority Languages", not the official language list.

They get certain recognition, but they are not official in Europe. For example, just from Spain there are 13 languages on that list.

mikrl · 2 months ago
As a Brit I feel very at home when hearing/reading Dutch and Frisian. It’s a reminder that England and the Low Countries share a lot of close history all the way back to Anglo-Saxon times; of being fishers, traders, burghers and mercenaries moving around the North Sea chasing opportunities, spreading and augmenting languages.

“Brea, bûter en griene tsiis is goed Ingelsk en goed Frysk”

przemub · 2 months ago
Each EU country nominates one official language for the EU, otherwise we'd have Catalan, Breton, Kashubian and many more.
sigmar · 2 months ago
Should be noted- the Netherlands can't unilaterally make changes. Spain has been trying to push for languages to be added and hasn't had luck.
rzwitserloot · 2 months ago
Not sure what happened there but your link disproves your statement.

Specifically, the link says two things:

1. That 2 parties want to add *limburgish* to the list, not frisian. That's the bottom-right part of The Netherlands, about as far removed from Friesland as you can get (which is the top part of the Netherlands).

2. That one party wants to add Frisian, but, that is a one-day fly party that will cease to exist in a few hours as they will get 0 seats in this election and will presumably call it a day right after. It was a party founded to support one person and that person has quit due to workstress, and is highly unlikely to return as this _was_ his return. Their opinion used to be relevant as they had 13.3% of the seats this past session (and didn't exist before it). But, it isn't here.

ginko · 2 months ago
Just do a 50:50 mix of the German and Dutch model weights.
purrcat259 · 2 months ago
I read, write and speak Maltese, AMA if you are curious about the language.
franklin_p_dyer · 2 months ago
Not a question, but - Tatoeba could use your help! It is an open source (both code and data) dataset of parallel sentences and their Maltese data is very lacking. Also it’s pretty fun to just translate a bunch of random sentences into a language you speak. :-)

https://tatoeba.org/

Raed667 · 2 months ago
Tunisians claim they can understand Maltese with minimum effort, is it reciprocal? How close is Maltese to arabic / tunisian dialect ?
barrell · 2 months ago
I recently discovered Maltese existed, and started learning it that day. I find it such an awesome language, and not just because of the letter Ħ

I do wonder what natives think and feel about the longevity of their language? What is taught in schools at what ages (assuming English is in the mix somewhere). Is there enough media in Maltese for Malti to go about the moderns at fully in Maltese? It’s shockingly hard to find any information on Maltese, and even harder to find content.

I’m not sure if’s dying out, or in danger thereof; if there are preservation efforts, or if there is no need.

nxor · 2 months ago
How are loan words viewed? Do businesses work in Maltese? Are monolingual speakers of the language regarded differently than those fluent in English? Do young people in Malta listen to Maltese music?
adzm · 2 months ago
I'm actually really curious about everyday usage of the language; is code switching between English and Maltese more common than Maltese on its own? I've seen a few online communities where the vocabulary switches between Maltese and English very often which is interesting but I wonder how much of that is just online / written versus everyday speech.
ebb_earl_co · 2 months ago
What is the name of Maltese in Maltese? Like “el español” in Spanish, it’s neat to know what languages call themselves
Tade0 · 2 months ago
How is "Marsaxlokk" really pronounced? I've heard that word a few times, but never from a native. Google translate can't help me here, as it doesn't seem to have Maltese text-to-speech.
runarberg · 2 months ago
Is there any dialect of Arabic which you can understand without too much effort?

How much do you consider Maltese its own language (as opposed to a dialect of Arabic)?

cm2012 · 2 months ago
Can you communicate with Maltese dogs more effectively?
jim180 · 2 months ago
Lithuanian and Latvian are Baltic languages. Nothing to do with Slavic...
adzm · 2 months ago
I was thinking about separating the two groups when I was writing this but was afraid of getting too verbose, though in retrospect that probably would have made more sense regardless of the historical lineage. My apologies if this came off as inconsiderate.

I updated my original comment, and learned a good amount about that dispute as a result, so thanks for calling it out.

kaato137 · 2 months ago
Balto-Slavic branch divides into Baltic and Slavic language groups so nothing wrong here
cyfex · 2 months ago
> Greek being the only Hellenic one

Are there really any other Hellenic languages besides Greek?

skissane · 2 months ago
Cappadocian Greek is Greek heavily mixed with Turkish, to the extent that is arguably better viewed as a distinct Hellenic language rather than just a nonstandard Greek dialect. However, around a century ago, most Greek speakers were expelled from Turkey and deported to Greece (and the same happened in reverse, most Turkish-speakers in Greece were deported to Turkey), including almost all Cappadocian-speakers - and they and their descendants largely switched to standard modern Greek - with the result that it was long believed that Cappadocian had died out in the 1960s, although more recently it has been discovered that there remain small populations of Cappadocians in rural Greece keeping the language alive.
sva_ · 2 months ago
Seems like the model isn't limited to those though, from the paper:

> as well as some additional relevant languages (Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian).

https://arxiv.org/pdf/2409.16235

The paper also goes into detail on training set sources, which I feel like a curation thereof might be considered the main contribution of this publication?

ChrisMarshallNY · 2 months ago
Flemish? I remember watching a TV show in Flemish (Hotel Beau Séjour[0]), so it's prevalent enough to invest that kind of money into.

What about Basque? Is that too controversial?

[0] https://en.wikipedia.org/wiki/Hotel_Beau_Séjour

yvdriess · 2 months ago
Flemish is more of a political construct than linguistic, it's a grouping of belgian-dutch the coastal, brabant and limburg language groups with each having their own regional dialects.
mytailorisrich · 2 months ago
I think those 24 languages reflect all the languages that are official languages at country level.

So for instance, Basque is not an official language of any country (only French in France and Spanish/Castilian in Spain). Belgium's official languages are French, Dutch, and German, "Flemish" is only a local variant of Dutch (Belgian French is also only a local variant of French).

tirant · 2 months ago
Basque is not controversial, but spoken just by very little people.
td540 · 2 months ago
like British English vs US English, Flemish is a dialect of dutch
amarant · 2 months ago
I find it interesting that Norwegian isn't on the list.

I have often joked that Norwegian is just a dialect of Swedish, but I never expected to get official validation like this!

rcbdev · 2 months ago
Norwegian is not on this list, because in fact no country with Norwegian as their national language is part of the European Union at the time of writing.
2000UltraDeluxe · 2 months ago
"Norwegian" isn't just one unified language and Norway isn't in the EU.

That being said, the Scandinavian languages all come from old Norse, and modern national constructs aside, most of the people in the those areas descend from the same mix of Germanic tribes. There's no denying that modern-day Danish, Norwegian and Swedish are very similar.

Deleted Comment

emil-lp · 2 months ago
Norway isn't in EU, though.
bdhtu · 2 months ago
Norway isn't in the EU.

Deleted Comment

_kidlike · 2 months ago
In Greek we call our language Hellenic, and our country Hellas. "Greek" / "Greece" don't exist in the Hellenic language.
ranadomo · 2 months ago
> Γραικοί, Graikoí were an ancient Hellenic tribe

https://en.wikipedia.org/wiki/Graecians

3836293648 · 2 months ago
Yes it does, it was a greek colony off the southern coast of Italy, which were the primary greek connection to the romans which how the name stuck.
fsckboy · 2 months ago
Is Ireland the only country to bring in two languages, Irish/Gaelic and English? Is English an official language of any other EU countries?
layer8 · 2 months ago
English is an official EU language because Regulation 1 Article 1 says so [0] and hasn’t been changed. In practice, English is the most widely used language in EU institutions, so it would be have been silly to remove it after Brexit.

[0] https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:01...

JAlexoid · 2 months ago
I believe Malta has English as an official language.

PS: Gaelic is a more general term for Irish and Scottish. Ireland brings specifically Irish(Gaeilge in Irish) language.

rags2riches · 2 months ago
Malta has Maltese and English as official languages. I don't know what they bring to the EU list of official languages.
ginko · 2 months ago
AFAIK Ireland only listed Gaelic as their official language with UK having English. That caused a bit of a problem during Brexit since technically English wasn't officially an EU language anymore. I guess they resolved it somehow.
rat87 · 2 months ago
Why Italic as opposed to Romantic/Latin? I don't think there are any surviving not Latin branches of the Italic family are there?
ks2048 · 2 months ago
From other comments, it seems many people don't realize that there are 11 more languages than these 24 official (this is mentioned in the paper):

Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.

jll29 · 2 months ago
+1
zhengiszen · 2 months ago
Maltese is derived from dialectical arabic
Qem · 2 months ago
What about Basque, is it not included?
ranc1d · 2 months ago
Basque doesn't seem to be but Catalan, Galician are

https://huggingface.co/utter-project/EuroLLM-9B

jenadine · 2 months ago
No Luxembourgish?
punnerud · 2 months ago
Norwegian is also included, based on the model card: https://huggingface.co/utter-project/EuroLLM-9B
unscaled · 2 months ago
Considering there are two different official written forms of Norwegian, that's not really saying enough, but I guess they mean Bokmål.
threesmegiste · 2 months ago
Turkish?
runarberg · 2 months ago
Is official in Northern Cyprus. But as I understand it while the whole island of Cyprus is in the EU, the state of Northern Cyprus isn’t.
Stagnant · 2 months ago
Title is missing "(2024)". The 9B model was released last december[0].

0: https://sites.google.com/view/eurollm/home

htrp · 2 months ago
>The EuroLLM Team brings together some of the brightest minds in AI including Unbabel, Instituto Tecnico Lisbon, the University of Edinburgh, Instituto de Telecommunicacoes, Université Paris-Saclay, Aveni, Sorbonne University, Naver Labs, and the University of Amsterdam.

>Europe is the only continent in the world to have a large public network of supercomputers that are managed by the EuroHPC Joint Undertaking (EuroHPC JU). As soon as we received the EuroHPC JU access to the supercomputer, we were ready to roll up our sleeves and get to work. We developed the small model right away and in less than 6 months the second model was ready.

[1] https://www.eurohpc-ju.europa.eu/eurohpc-success-story-speak...

Repurposing some of that physics sim compute

biohazard2 · 2 months ago
>Europe is the only continent in the world to have a large public network of supercomputers that are managed by the EuroHPC Joint Undertaking (EuroHPC JU).

Who would have thought that Europe is the only continent to have a network of supercomputers managed by Europe⸮

blitzar · 2 months ago
This is the extent of the moat.
loandbehold · 2 months ago
Aren't all frontier models already able to use all these languages? Support for specific languages doesn't need to be built in, LLMs support all languages because they are trained on multilingual data.
melvinmelih · 2 months ago
> because they are trained on multilingual data

But they were not trained on government-sanctioned homegrown EU data.

sunaookami · 2 months ago
Who in their right mind would use this?
saretup · 2 months ago
The entirety of the internet vs government-sanctioned homegrown EU data.
tonyhart7 · 2 months ago
"But they were not trained on government-sanctioned homegrown EU data."

ok what are you implying on this

raverbashing · 2 months ago
> But they were not trained on government-sanctioned homegrown EU data.

If none of the LLM makers used the very big corpus of EU multilingual data I have an EU regulation bridge to sell it to you

tensor · 2 months ago
No, that's not how training works. It's not just about having an example in a given language, but also how many examples and the ratio of examples compared to other languages. English hugely eclipses any other language on most US models and that's why performance on other languages is subpar compared to performance on english.
Byamarro · 2 months ago
There's actually a research showing that llms are more accurate when questions are in Polish: https://arxiv.org/pdf/2503.01996
andy12_ · 2 months ago
I have never noticed any major difference in performance of ChatGPT between English and Spanish. The truth is that as long as the amount of training data of a given language is above some threshold, knowledge transfers between languages.
voxgen · 2 months ago
Ratio/quantity is important, but quality is even more so.

In recent LLMs, filtered internet text is at the low end of the quality spectrum. The higher end is curated scientific papers, synthetic and rephrased text, RLHF conversations, reasoning CoTs, etc. English/Chinese/Python/JavaScript dominate here.

The issue is that when there's a difference in training data quality between languages, LLMs likely associate that difference with the languages if not explicitly compensated for.

IMO it would be far more impactful to generate and publish high-quality data for minority languages for current model trainers, than to train new models that are simply enriched with a higher percentage of low-quality internet scrapings for the languages.

charlieyu1 · 2 months ago
Training is a very different thing. Can’t speak for European, but LLMs are often much worse in Japanese because tokenisation used Unicode and a single Japanese character often has to be represented by more than one token
unscaled · 2 months ago
I think you meant to say that tokenization is usually done with UTF-8 and a single Japanese character generally takes 3 or more code units (i.e. bytes). Unicode itself is not the culprit (in fact, even with UTF-16 tokenization, most Japanese characters would fit in a single code unit, and the ones that won't are exceedingly rare).

I have to admit I have not encountered significant mistokenization issues in Japanese, but I'm not using it on a daily basis LLMs. I'm somewhat dobutful this can be a major issue, since frontier LLMs are absolutely in love with Emoji, and Emoji requires at least 4 UTF-8 bytes, while most Japanese characters are happy with just 3 bytes.

intended · 2 months ago
Nope. Capability begins to degrade once you move away from english.

Plus all your T&S/AI Safety is not solved with translation, you need lexicons and data sets of examples.

Like, people use someone in Malaysia, to label the Arabic spoken by someone playing a video game in Doha - the cultural context is missing.

The best proxy to show the degree of lopsidedness was from this : https://cdt.org/insights/lost-in-translation-large-language-...

Which in turn had to base it on this: https://stats.aclrollingreview.org/submissions/linguistic-di...

From what I am aware of, LLM capability degrades once you move out of English, and many nation states are either building, or considering the option of building their own LLMs.

numpad0 · 2 months ago
Not natively, they all sound translated in languages other than English. I occasionally come across French people complaining about LLMs' use of non-idiomatic French, but it's probably not a French problem at all, considering that this effort includes so many Indo-European languages.
FinnKuhn · 2 months ago
I can at least also confirm this for German. Here is one example that is quite annyoing:

Chat GPT for example tends to start emails with "ich hoffe, es geht dir gut!", which means "I hope you are well!". In English (especially American) corporate emails this is a really common way to start an email. In German it is not as "how are you" isn't a common phrase used here.

ideasarecool · 2 months ago
Term support is vague. Can you do basic interaction in most other languages? Sure. Is it anywhere close to competence it has in english? No. Most models seem to just translate english responses at beginners simplistic monotone level.
whazor · 2 months ago
European governments have huge collections of digitalised books, research, public data.

But also European culture could maybe make a difference? You can already see big differences between Grok and ChatGPT in terms of values.

pembrook · 2 months ago
If it's publicly available data, books and research, I can assure you the big models have already all been trained on it.

European culture is already embedded in all the models, unless the people involved in this project have some hidden trove of private data that they're training on which diverges drastically from things Europeans have published publicly (I'm 99.9% positive they don't...especially given Europe's alarmist attitude around anything related to data).

I think people don't understand a huge percentage of the employees at OpenAI, Anthropic, etc. are non-US born.

lm28469 · 2 months ago
Meh, it depends a lot on the dataset, which are heavily skewed towards the main languages. For example they almost always confuse Czech and Slovak and often swap one for the other in middle of chats
mirekrusin · 2 months ago
But the only way to unskew it is to remove main language data because there isn't really any to add, no?
RobotToaster · 2 months ago
Aren't they about as different as American English and British English?

Deleted Comment

adt · 2 months ago
The EuroLLM-9B model release is from Dec/2024, and scores just above random chance for benchmarks like MMLU-Pro (17.6%, random chance is 10%).

Comparison with similar EU models + 600 other highlights:

https://lifearchitect.ai/models-table/

hebejebelus · 2 months ago
Some cursory clicking about didn't reveal to me the actual corpus they used, only that it is several trillion tokens 'divided across the languages'. I'm curious mainly because Irish (among some other similarly endangered languages on the list) typically has any large corpus come from legal/governmental texts that are required to be translated. There must surely be only a relatively tiny amount of colloquial Irish in the corpus. It be interesting to see some evals in each language particularly with native speakers.

I think LLMs may be on the whole very positive for endangered languages such as Irish, but before it becomes positive I think there's an amount of danger to be navigated (see Scots Gaelic wikipedia drama for example)

In any case I think this is a great initiative.

tadzikpk · 2 months ago
That tracks. I learnt Gaeilge Uladh growing up and standard Irish feels like reading or writing a legal agreement compared to the spoken word…
Timwi · 2 months ago
Can you provide a link about the “Scots Gaelic Wikipedia drama” you reference? I've heard of drama related to the Scots Wikipedia but that has nothing to do with Gaelic.
hebejebelus · 2 months ago
My apologies, it was the Scots Wikipedia, careless of me. Link for context: https://en.wikipedia.org/wiki/Scots_Wikipedia#Controversy
srameshc · 2 months ago
I was thinking the same, why are so many superior models coming from only countries like US and China. And why are European countries not in the list other than France with Mistral. Why are so few companies in India, Japan, South Korea even close to a promising new model like what Chinese companies did ?
nonethewiser · 2 months ago
"Why" is a fair question but are you surprised? Europe is consistently behind in tech.

Europe has about 1.3 times the population of the USA and about 75% of the GDP yet EU tech output is a very small percentage of US tech output. We are not talking about 70, 50, 30, or even 20%. It's a drop in the bucket.

>The seven largest U.S. tech companies, Alphabet (Google), Amazon, Apple, Meta, Microsoft, Nvidia, and Tesla, are 20 times bigger than Europe’s seven largest, and generate 10 times more revenue.

https://eqtgroup.com/thinq/technology/why-is-europes-tech-in...

"Why" is a good question, but I definitely wouldnt expect significant competition in LLMs from Europe based on the giant tech disparity. Having 1 non-cutting edge model that isn't really competitive is pretty much what I would expect.

InsideOutSanta · 2 months ago
> The seven largest U.S. tech companies (...) are 20 times bigger than Europe’s seven largest, and generate 10 times more revenue.

I'm going to guess that this part is intentional. Europe tends to be more aggressive in enforcing antitrust laws. Economically, Europe's goal isn't to have the biggest companies but to have more smaller companies.

So you're not going to get companies like Google, but you will get companies like Proton, Spotify, Tuta, Hetzner, Mistral, Threema, Filen, Babbel, Nextcloud, CryptPad, DeepL, Vivaldi, and so on.

emporas · 2 months ago
Also, commercial software is consistently behind from open source.

I only use open source LLMs for writing (Qwen 32b from Groq) and open source editor of course, Emacs.

If some people can write better using commercial LLMs (and commercial editors), by all means, but they put themselves at a disadvantage.

Next step for me, is to use something open source for translation, I use Claude for the moment, and open source for programming, I use GPT curently. In less than a year I will find a satisfying solution to both of these problems. I haven't looked deep enough.

sublimefire · 2 months ago
As a European citizen I think it boils down to access to the capital. EU/EEA is not a country and the market is sort of fragmented. The big players are UK, France, Germany, everyone else does not have the same access to money as say in the US. Folks want to do it but there is a glass ceiling. Hence you have these collabs among large institutions to tap into funds such as from Horizon which are academic in nature and do not translate well into products.
izacus · 2 months ago
The fun part is that people whining about not being able to raise common capital and operate across whole EU due to regulation tend to also be the most rabid opposers of any kind of common regulation that would bring whole EU into alignment and make it a less fragmented market.
loandbehold · 2 months ago
Because training frontier model is expensive and only US and China have capital structure to raise tens of billions of dollars to do it.
lossolo · 2 months ago
You can easily fit below 10 billion for the whole datacenter, then you only pay for electricity + maintenance + staff. 100k GPUs cost a few billion USD, that's more than enough to train frontier models, run experiments, and serve models in the EU to start. Look at what xAI did and how much it cost them and it's more expensive to do in US than in EU.
busssard · 2 months ago
being able to train new frontier models is the new equivalent to nuclear capabilities.

i predict at some point countries will get CIA'ed when they publish plans to build a large data center.

Similar to the time when they got CIA'ed when announcing plans for new nuclear plants.

sunaookami · 2 months ago
EU made a >900 page law about AI and patted themselves on the back for being "the first to regulate AI" (which was not even true, China had an AI law before and it's two pages long).
sajithdilshan · 2 months ago
This cannot be stressed enough. In my experience working in multiple tech startups in Germany, the power compliance, legal and all other 2nd line has over engineering is quite immense. Most of the time they act as a hindrance for innovation rather than a supporting factor.

This AI law is a clear example of that. Pencil pushers creating more obstacles for the sake of creating more obstacles rather than actually taking a pragmatic approach.

isodev · 2 months ago
Because the value of these models is (actually) yet to be proven. Why saturate the market with something that we already have at least one of and others are selling as a service? No model provider (including the "big ones" like OpenAI) has been able to produce a viable business case. They're all literally running on government deals and investor money.
mensetmanusman · 2 months ago
It’s been proven that it is valuable to be able to convert English into executable code that does what is wanted.
apples_oranges · 2 months ago
Does it even make sense? Just use the American or Chinese ones, adjust As needed. Where’s the point in spending millions to build The same thing or worse
t43562 · 2 months ago
Now that the big bets have been made, who wants to try to compete with them?
sireat · 2 months ago
It is interesting how much traction this 9B model is getting which is good.

Still two month earlier 19 European language model with 30B parameters got almost no mention:

https://huggingface.co/TildeAI/TildeOpen-30b

Mind you that is another open model that is begging for fine-tuning (it is not very good out of box).