EuroLLM: LLM made in Europe built to support all 24 official EU languages

For those curious, the 24 official languages are Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish.

Maltese, interestingly, is the only Afro-Asiatic derived language.

Hungarian, Finnish, and Estonian are the three Uralic languages.

All the others are Indo-European, Greek being the only Hellenic one, Irish the only Celtic, the rest are Baltic, Slavic, Italic, or Germanic.

(I originally used the term Balto-Slavic, though I was unaware of some of the connotations of that term until just now. Baltic and Slavic do share a common origin, but that was a very very long time ago)

arbuge · 2 months ago

> Maltese, interestingly, is the only Afro-Asiatic derived language.

It's Semitic, to be precise.

https://en.wikipedia.org/wiki/Semitic_languages

UebVar · 2 months ago

Arabic, even. An outlier, as it is AFAIK the only arabic dialect that is not written with the arabic alphabet. Also it's far removed from other arabic dialects.

Vinnl · 2 months ago

Tomorrow there are elections in the Netherlands, and two parties are proposing adding Frysian to that list: https://neerlandistiek.nl/2025/10/kies-voor-taal/

Best get to retraining those models.

tecleandor · 2 months ago

AFAIK, they are trying to get Frisian added to the "European Charter for Regional or Minority Languages", not the official language list.

They get certain recognition, but they are not official in Europe. For example, just from Spain there are 13 languages on that list.

mikrl · 2 months ago

As a Brit I feel very at home when hearing/reading Dutch and Frisian. It’s a reminder that England and the Low Countries share a lot of close history all the way back to Anglo-Saxon times; of being fishers, traders, burghers and mercenaries moving around the North Sea chasing opportunities, spreading and augmenting languages.

“Brea, bûter en griene tsiis is goed Ingelsk en goed Frysk”

przemub · 2 months ago

Each EU country nominates one official language for the EU, otherwise we'd have Catalan, Breton, Kashubian and many more.

sigmar · 2 months ago

Should be noted- the Netherlands can't unilaterally make changes. Spain has been trying to push for languages to be added and hasn't had luck.

rzwitserloot · 2 months ago

Not sure what happened there but your link disproves your statement.

Specifically, the link says two things:

1. That 2 parties want to add *limburgish* to the list, not frisian. That's the bottom-right part of The Netherlands, about as far removed from Friesland as you can get (which is the top part of the Netherlands).

2. That one party wants to add Frisian, but, that is a one-day fly party that will cease to exist in a few hours as they will get 0 seats in this election and will presumably call it a day right after. It was a party founded to support one person and that person has quit due to workstress, and is highly unlikely to return as this _was_ his return. Their opinion used to be relevant as they had 13.3% of the seats this past session (and didn't exist before it). But, it isn't here.

ginko · 2 months ago

Just do a 50:50 mix of the German and Dutch model weights.

purrcat259 · 2 months ago

I read, write and speak Maltese, AMA if you are curious about the language.

franklin_p_dyer · 2 months ago

Not a question, but - Tatoeba could use your help! It is an open source (both code and data) dataset of parallel sentences and their Maltese data is very lacking. Also it’s pretty fun to just translate a bunch of random sentences into a language you speak. :-)

https://tatoeba.org/

Raed667 · 2 months ago

Tunisians claim they can understand Maltese with minimum effort, is it reciprocal? How close is Maltese to arabic / tunisian dialect ?

barrell · 2 months ago

I recently discovered Maltese existed, and started learning it that day. I find it such an awesome language, and not just because of the letter Ħ

I do wonder what natives think and feel about the longevity of their language? What is taught in schools at what ages (assuming English is in the mix somewhere). Is there enough media in Maltese for Malti to go about the moderns at fully in Maltese? It’s shockingly hard to find any information on Maltese, and even harder to find content.

I’m not sure if’s dying out, or in danger thereof; if there are preservation efforts, or if there is no need.

nxor · 2 months ago

How are loan words viewed? Do businesses work in Maltese? Are monolingual speakers of the language regarded differently than those fluent in English? Do young people in Malta listen to Maltese music?

adzm · 2 months ago

I'm actually really curious about everyday usage of the language; is code switching between English and Maltese more common than Maltese on its own? I've seen a few online communities where the vocabulary switches between Maltese and English very often which is interesting but I wonder how much of that is just online / written versus everyday speech.

ebb_earl_co · 2 months ago

What is the name of Maltese in Maltese? Like “el español” in Spanish, it’s neat to know what languages call themselves

Tade0 · 2 months ago

How is "Marsaxlokk" really pronounced? I've heard that word a few times, but never from a native. Google translate can't help me here, as it doesn't seem to have Maltese text-to-speech.

runarberg · 2 months ago

Is there any dialect of Arabic which you can understand without too much effort?

How much do you consider Maltese its own language (as opposed to a dialect of Arabic)?

cm2012 · 2 months ago

Can you communicate with Maltese dogs more effectively?

jim180 · 2 months ago

Lithuanian and Latvian are Baltic languages. Nothing to do with Slavic...

adzm · 2 months ago

I was thinking about separating the two groups when I was writing this but was afraid of getting too verbose, though in retrospect that probably would have made more sense regardless of the historical lineage. My apologies if this came off as inconsiderate.

I updated my original comment, and learned a good amount about that dispute as a result, so thanks for calling it out.

Telaneo · 2 months ago

https://en.wikipedia.org/wiki/Balto-Slavic_languages

kaato137 · 2 months ago

Balto-Slavic branch divides into Baltic and Slavic language groups so nothing wrong here

cyfex · 2 months ago

> Greek being the only Hellenic one

Are there really any other Hellenic languages besides Greek?

skissane · 2 months ago

Cappadocian Greek is Greek heavily mixed with Turkish, to the extent that is arguably better viewed as a distinct Hellenic language rather than just a nonstandard Greek dialect. However, around a century ago, most Greek speakers were expelled from Turkey and deported to Greece (and the same happened in reverse, most Turkish-speakers in Greece were deported to Turkey), including almost all Cappadocian-speakers - and they and their descendants largely switched to standard modern Greek - with the result that it was long believed that Cappadocian had died out in the 1960s, although more recently it has been discovered that there remain small populations of Cappadocians in rural Greece keeping the language alive.

sva_ · 2 months ago

Seems like the model isn't limited to those though, from the paper:

> as well as some additional relevant languages (Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian).

https://arxiv.org/pdf/2409.16235

The paper also goes into detail on training set sources, which I feel like a curation thereof might be considered the main contribution of this publication?

ChrisMarshallNY · 2 months ago

Flemish? I remember watching a TV show in Flemish (Hotel Beau Séjour[0]), so it's prevalent enough to invest that kind of money into.

What about Basque? Is that too controversial?

[0] https://en.wikipedia.org/wiki/Hotel_Beau_Séjour

yvdriess · 2 months ago

Flemish is more of a political construct than linguistic, it's a grouping of belgian-dutch the coastal, brabant and limburg language groups with each having their own regional dialects.

mytailorisrich · 2 months ago

I think those 24 languages reflect all the languages that are official languages at country level.

So for instance, Basque is not an official language of any country (only French in France and Spanish/Castilian in Spain). Belgium's official languages are French, Dutch, and German, "Flemish" is only a local variant of Dutch (Belgian French is also only a local variant of French).

tirant · 2 months ago

Basque is not controversial, but spoken just by very little people.

td540 · 2 months ago

like British English vs US English, Flemish is a dialect of dutch

amarant · 2 months ago

I find it interesting that Norwegian isn't on the list.

I have often joked that Norwegian is just a dialect of Swedish, but I never expected to get official validation like this!

rcbdev · 2 months ago

Norwegian is not on this list, because in fact no country with Norwegian as their national language is part of the European Union at the time of writing.

2000UltraDeluxe · 2 months ago

"Norwegian" isn't just one unified language and Norway isn't in the EU.

That being said, the Scandinavian languages all come from old Norse, and modern national constructs aside, most of the people in the those areas descend from the same mix of Germanic tribes. There's no denying that modern-day Danish, Norwegian and Swedish are very similar.

Deleted Comment

emil-lp · 2 months ago

Norway isn't in EU, though.

bdhtu · 2 months ago

Norway isn't in the EU.

Deleted Comment

_kidlike · 2 months ago

In Greek we call our language Hellenic, and our country Hellas. "Greek" / "Greece" don't exist in the Hellenic language.

ranadomo · 2 months ago

> Γραικοί, Graikoí were an ancient Hellenic tribe

https://en.wikipedia.org/wiki/Graecians

3836293648 · 2 months ago

Yes it does, it was a greek colony off the southern coast of Italy, which were the primary greek connection to the romans which how the name stuck.

fsckboy · 2 months ago

Is Ireland the only country to bring in two languages, Irish/Gaelic and English? Is English an official language of any other EU countries?

layer8 · 2 months ago

English is an official EU language because Regulation 1 Article 1 says so [0] and hasn’t been changed. In practice, English is the most widely used language in EU institutions, so it would be have been silly to remove it after Brexit.

[0] https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:01...

JAlexoid · 2 months ago

I believe Malta has English as an official language.

PS: Gaelic is a more general term for Irish and Scottish. Ireland brings specifically Irish(Gaeilge in Irish) language.

rags2riches · 2 months ago

Malta has Maltese and English as official languages. I don't know what they bring to the EU list of official languages.

ginko · 2 months ago

AFAIK Ireland only listed Gaelic as their official language with UK having English. That caused a bit of a problem during Brexit since technically English wasn't officially an EU language anymore. I guess they resolved it somehow.

rat87 · 2 months ago

Why Italic as opposed to Romantic/Latin? I don't think there are any surviving not Latin branches of the Italic family are there?

ks2048 · 2 months ago

From other comments, it seems many people don't realize that there are 11 more languages than these 24 official (this is mentioned in the paper):

Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.

jll29 · 2 months ago

zhengiszen · 2 months ago

Maltese is derived from dialectical arabic

Qem · 2 months ago

What about Basque, is it not included?

ranc1d · 2 months ago

Basque doesn't seem to be but Catalan, Galician are

https://huggingface.co/utter-project/EuroLLM-9B

jenadine · 2 months ago

No Luxembourgish?

punnerud · 2 months ago

Norwegian is also included, based on the model card: https://huggingface.co/utter-project/EuroLLM-9B

unscaled · 2 months ago

Considering there are two different official written forms of Norwegian, that's not really saying enough, but I guess they mean Bokmål.

threesmegiste · 2 months ago

Turkish?

runarberg · 2 months ago

Is official in Northern Cyprus. But as I understand it while the whole island of Cyprus is in the EU, the state of Northern Cyprus isn’t.

Aren't all frontier models already able to use all these languages? Support for specific languages doesn't need to be built in, LLMs support all languages because they are trained on multilingual data.

melvinmelih · 2 months ago

> because they are trained on multilingual data

But they were not trained on government-sanctioned homegrown EU data.

sunaookami · 2 months ago

Who in their right mind would use this?

saretup · 2 months ago

The entirety of the internet vs government-sanctioned homegrown EU data.

tonyhart7 · 2 months ago

"But they were not trained on government-sanctioned homegrown EU data."

ok what are you implying on this

raverbashing · 2 months ago

> But they were not trained on government-sanctioned homegrown EU data.

If none of the LLM makers used the very big corpus of EU multilingual data I have an EU regulation bridge to sell it to you

tensor · 2 months ago

No, that's not how training works. It's not just about having an example in a given language, but also how many examples and the ratio of examples compared to other languages. English hugely eclipses any other language on most US models and that's why performance on other languages is subpar compared to performance on english.

Byamarro · 2 months ago

There's actually a research showing that llms are more accurate when questions are in Polish: https://arxiv.org/pdf/2503.01996

andy12_ · 2 months ago

I have never noticed any major difference in performance of ChatGPT between English and Spanish. The truth is that as long as the amount of training data of a given language is above some threshold, knowledge transfers between languages.

voxgen · 2 months ago

Ratio/quantity is important, but quality is even more so.

In recent LLMs, filtered internet text is at the low end of the quality spectrum. The higher end is curated scientific papers, synthetic and rephrased text, RLHF conversations, reasoning CoTs, etc. English/Chinese/Python/JavaScript dominate here.

The issue is that when there's a difference in training data quality between languages, LLMs likely associate that difference with the languages if not explicitly compensated for.

IMO it would be far more impactful to generate and publish high-quality data for minority languages for current model trainers, than to train new models that are simply enriched with a higher percentage of low-quality internet scrapings for the languages.

charlieyu1 · 2 months ago

Training is a very different thing. Can’t speak for European, but LLMs are often much worse in Japanese because tokenisation used Unicode and a single Japanese character often has to be represented by more than one token

unscaled · 2 months ago

I think you meant to say that tokenization is usually done with UTF-8 and a single Japanese character generally takes 3 or more code units (i.e. bytes). Unicode itself is not the culprit (in fact, even with UTF-16 tokenization, most Japanese characters would fit in a single code unit, and the ones that won't are exceedingly rare).

I have to admit I have not encountered significant mistokenization issues in Japanese, but I'm not using it on a daily basis LLMs. I'm somewhat dobutful this can be a major issue, since frontier LLMs are absolutely in love with Emoji, and Emoji requires at least 4 UTF-8 bytes, while most Japanese characters are happy with just 3 bytes.

intended · 2 months ago

Nope. Capability begins to degrade once you move away from english.

Plus all your T&S/AI Safety is not solved with translation, you need lexicons and data sets of examples.

Like, people use someone in Malaysia, to label the Arabic spoken by someone playing a video game in Doha - the cultural context is missing.

The best proxy to show the degree of lopsidedness was from this : https://cdt.org/insights/lost-in-translation-large-language-...

Which in turn had to base it on this: https://stats.aclrollingreview.org/submissions/linguistic-di...

From what I am aware of, LLM capability degrades once you move out of English, and many nation states are either building, or considering the option of building their own LLMs.

numpad0 · 2 months ago

Not natively, they all sound translated in languages other than English. I occasionally come across French people complaining about LLMs' use of non-idiomatic French, but it's probably not a French problem at all, considering that this effort includes so many Indo-European languages.

FinnKuhn · 2 months ago

I can at least also confirm this for German. Here is one example that is quite annyoing:

Chat GPT for example tends to start emails with "ich hoffe, es geht dir gut!", which means "I hope you are well!". In English (especially American) corporate emails this is a really common way to start an email. In German it is not as "how are you" isn't a common phrase used here.

ideasarecool · 2 months ago

Term support is vague. Can you do basic interaction in most other languages? Sure. Is it anywhere close to competence it has in english? No. Most models seem to just translate english responses at beginners simplistic monotone level.

whazor · 2 months ago

European governments have huge collections of digitalised books, research, public data.

But also European culture could maybe make a difference? You can already see big differences between Grok and ChatGPT in terms of values.

pembrook · 2 months ago

If it's publicly available data, books and research, I can assure you the big models have already all been trained on it.

European culture is already embedded in all the models, unless the people involved in this project have some hidden trove of private data that they're training on which diverges drastically from things Europeans have published publicly (I'm 99.9% positive they don't...especially given Europe's alarmist attitude around anything related to data).

I think people don't understand a huge percentage of the employees at OpenAI, Anthropic, etc. are non-US born.

lm28469 · 2 months ago

Meh, it depends a lot on the dataset, which are heavily skewed towards the main languages. For example they almost always confuse Czech and Slovak and often swap one for the other in middle of chats

mirekrusin · 2 months ago

But the only way to unskew it is to remove main language data because there isn't really any to add, no?

RobotToaster · 2 months ago

Aren't they about as different as American English and British English?

Deleted Comment