For those curious, the 24 official languages are Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish.
Maltese, interestingly, is the only Afro-Asiatic derived language.
Hungarian, Finnish, and Estonian are the three Uralic languages.
All the others are Indo-European, Greek being the only Hellenic one, Irish the only Celtic, the rest are Baltic, Slavic, Italic, or Germanic.
(I originally used the term Balto-Slavic, though I was unaware of some of the connotations of that term until just now. Baltic and Slavic do share a common origin, but that was a very very long time ago)
Arabic, even. An outlier, as it is AFAIK the only arabic dialect that is not written with the arabic alphabet. Also it's far removed from other arabic dialects.
As a Brit I feel very at home when hearing/reading Dutch and Frisian. It’s a reminder that England and the Low Countries share a lot of close history all the way back to Anglo-Saxon times; of being fishers, traders, burghers and mercenaries moving around the North Sea chasing opportunities, spreading and augmenting languages.
“Brea, bûter en griene tsiis is goed Ingelsk en goed Frysk”
Not sure what happened there but your link disproves your statement.
Specifically, the link says two things:
1. That 2 parties want to add *limburgish* to the list, not frisian. That's the bottom-right part of The Netherlands, about as far removed from Friesland as you can get (which is the top part of the Netherlands).
2. That one party wants to add Frisian, but, that is a one-day fly party that will cease to exist in a few hours as they will get 0 seats in this election and will presumably call it a day right after. It was a party founded to support one person and that person has quit due to workstress, and is highly unlikely to return as this _was_ his return. Their opinion used to be relevant as they had 13.3% of the seats this past session (and didn't exist before it). But, it isn't here.
Not a question, but - Tatoeba could use your help! It is an open source (both code and data) dataset of parallel sentences and their Maltese data is very lacking. Also it’s pretty fun to just translate a bunch of random sentences into a language you speak. :-)
I recently discovered Maltese existed, and started learning it that day. I find it such an awesome language, and not just because of the letter Ħ
I do wonder what natives think and feel about the longevity of their language? What is taught in schools at what ages (assuming English is in the mix somewhere). Is there enough media in Maltese for Malti to go about the moderns at fully in Maltese? It’s shockingly hard to find any information on Maltese, and even harder to find content.
I’m not sure if’s dying out, or in danger thereof; if there are preservation efforts, or if there is no need.
How are loan words viewed? Do businesses work in Maltese? Are monolingual speakers of the language regarded differently than those fluent in English? Do young people in Malta listen to Maltese music?
I'm actually really curious about everyday usage of the language; is code switching between English and Maltese more common than Maltese on its own? I've seen a few online communities where the vocabulary switches between Maltese and English very often which is interesting but I wonder how much of that is just online / written versus everyday speech.
How is "Marsaxlokk" really pronounced? I've heard that word a few times, but never from a native. Google translate can't help me here, as it doesn't seem to have Maltese text-to-speech.
I was thinking about separating the two groups when I was writing this but was afraid of getting too verbose, though in retrospect that probably would have made more sense regardless of the historical lineage. My apologies if this came off as inconsiderate.
I updated my original comment, and learned a good amount about that dispute as a result, so thanks for calling it out.
Cappadocian Greek is Greek heavily mixed with Turkish, to the extent that is arguably better viewed as a distinct Hellenic language rather than just a nonstandard Greek dialect. However, around a century ago, most Greek speakers were expelled from Turkey and deported to Greece (and the same happened in reverse, most Turkish-speakers in Greece were deported to Turkey), including almost all Cappadocian-speakers - and they and their descendants largely switched to standard modern Greek - with the result that it was long believed that Cappadocian had died out in the 1960s, although more recently it has been discovered that there remain small populations of Cappadocians in rural Greece keeping the language alive.
Seems like the model isn't limited to those though, from the paper:
> as well as some additional relevant languages (Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian).
The paper also goes into detail on training set sources, which I feel like a curation thereof might be considered the main contribution of this publication?
Flemish is more of a political construct than linguistic, it's a grouping of belgian-dutch the coastal, brabant and limburg language groups with each having their own regional dialects.
I think those 24 languages reflect all the languages that are official languages at country level.
So for instance, Basque is not an official language of any country (only French in France and Spanish/Castilian in Spain). Belgium's official languages are French, Dutch, and German, "Flemish" is only a local variant of Dutch (Belgian French is also only a local variant of French).
Norwegian is not on this list, because in fact no country with Norwegian as their national language is part of the European Union at the time of writing.
"Norwegian" isn't just one unified language and Norway isn't in the EU.
That being said, the Scandinavian languages all come from old Norse, and modern national constructs aside, most of the people in the those areas descend from the same mix of Germanic tribes. There's no denying that modern-day Danish, Norwegian and Swedish are very similar.
English is an official EU language because Regulation 1 Article 1 says so [0] and hasn’t been changed. In practice, English is the most widely used language in EU institutions, so it would be have been silly to remove it after Brexit.
AFAIK Ireland only listed Gaelic as their official language with UK having English. That caused a bit of a problem during Brexit since technically English wasn't officially an EU language anymore. I guess they resolved it somehow.
>The EuroLLM Team brings together some of the brightest minds in AI including Unbabel, Instituto Tecnico Lisbon, the University of Edinburgh, Instituto de Telecommunicacoes, Université Paris-Saclay, Aveni, Sorbonne University, Naver Labs, and the University of Amsterdam.
>Europe is the only continent in the world to have a large public network of supercomputers that are managed by the EuroHPC Joint Undertaking (EuroHPC JU). As soon as we received the EuroHPC JU access to the supercomputer, we were ready to roll up our sleeves and get to work. We developed the small model right away and in less than 6 months the second model was ready.
>Europe is the only continent in the world to have a large public network of supercomputers that are managed by the EuroHPC Joint Undertaking (EuroHPC JU).
Who would have thought that Europe is the only continent to have a network of supercomputers managed by Europe⸮
Aren't all frontier models already able to use all these languages? Support for specific languages doesn't need to be built in, LLMs support all languages because they are trained on multilingual data.
No, that's not how training works. It's not just about having an example in a given language, but also how many examples and the ratio of examples compared to other languages. English hugely eclipses any other language on most US models and that's why performance on other languages is subpar compared to performance on english.
I have never noticed any major difference in performance of ChatGPT between English and Spanish. The truth is that as long as the amount of training data of a given language is above some threshold, knowledge transfers between languages.
Ratio/quantity is important, but quality is even more so.
In recent LLMs, filtered internet text is at the low end of the quality spectrum. The higher end is curated scientific papers, synthetic and rephrased text, RLHF conversations, reasoning CoTs, etc. English/Chinese/Python/JavaScript dominate here.
The issue is that when there's a difference in training data quality between languages, LLMs likely associate that difference with the languages if not explicitly compensated for.
IMO it would be far more impactful to generate and publish high-quality data for minority languages for current model trainers, than to train new models that are simply enriched with a higher percentage of low-quality internet scrapings for the languages.
Training is a very different thing. Can’t speak for European, but LLMs are often much worse in Japanese because tokenisation used Unicode and a single Japanese character often has to be represented by more than one token
I think you meant to say that tokenization is usually done with UTF-8 and a single Japanese character generally takes 3 or more code units (i.e. bytes). Unicode itself is not the culprit (in fact, even with UTF-16 tokenization, most Japanese characters would fit in a single code unit, and the ones that won't are exceedingly rare).
I have to admit I have not encountered significant mistokenization issues in Japanese, but I'm not using it on a daily basis LLMs. I'm somewhat dobutful this can be a major issue, since frontier LLMs are absolutely in love with Emoji, and Emoji requires at least 4 UTF-8 bytes, while most Japanese characters are happy with just 3 bytes.
From what I am aware of, LLM capability degrades once you move out of English, and many nation states are either building, or considering the option of building their own LLMs.
Not natively, they all sound translated in languages other than English. I occasionally come across French people complaining about LLMs' use of non-idiomatic French, but it's probably not a French problem at all, considering that this effort includes so many Indo-European languages.
I can at least also confirm this for German. Here is one example that is quite annyoing:
Chat GPT for example tends to start emails with "ich hoffe, es geht dir gut!", which means "I hope you are well!". In English (especially American) corporate emails this is a really common way to start an email. In German it is not as "how are you" isn't a common phrase used here.
Term support is vague. Can you do basic interaction in most other languages? Sure. Is it anywhere close to competence it has in english? No. Most models seem to just translate english responses at beginners simplistic monotone level.
If it's publicly available data, books and research, I can assure you the big models have already all been trained on it.
European culture is already embedded in all the models, unless the people involved in this project have some hidden trove of private data that they're training on which diverges drastically from things Europeans have published publicly (I'm 99.9% positive they don't...especially given Europe's alarmist attitude around anything related to data).
I think people don't understand a huge percentage of the employees at OpenAI, Anthropic, etc. are non-US born.
Meh, it depends a lot on the dataset, which are heavily skewed towards the main languages. For example they almost always confuse Czech and Slovak and often swap one for the other in middle of chats
Some cursory clicking about didn't reveal to me the actual corpus they used, only that it is several trillion tokens 'divided across the languages'. I'm curious mainly because Irish (among some other similarly endangered languages on the list) typically has any large corpus come from legal/governmental texts that are required to be translated. There must surely be only a relatively tiny amount of colloquial Irish in the corpus. It be interesting to see some evals in each language particularly with native speakers.
I think LLMs may be on the whole very positive for endangered languages such as Irish, but before it becomes positive I think there's an amount of danger to be navigated (see Scots Gaelic wikipedia drama for example)
Can you provide a link about the “Scots Gaelic Wikipedia drama” you reference? I've heard of drama related to the Scots Wikipedia but that has nothing to do with Gaelic.
I was thinking the same, why are so many superior models coming from only countries like US and China. And why are European countries not in the list other than France with Mistral. Why are so few companies in India, Japan, South Korea even close to a promising new model like what Chinese companies did ?
"Why" is a fair question but are you surprised? Europe is consistently behind in tech.
Europe has about 1.3 times the population of the USA and about 75% of the GDP yet EU tech output is a very small percentage of US tech output. We are not talking about 70, 50, 30, or even 20%. It's a drop in the bucket.
>The seven largest U.S. tech companies, Alphabet (Google), Amazon, Apple, Meta, Microsoft, Nvidia, and Tesla, are 20 times bigger than Europe’s seven largest, and generate 10 times more revenue.
"Why" is a good question, but I definitely wouldnt expect significant competition in LLMs from Europe based on the giant tech disparity. Having 1 non-cutting edge model that isn't really competitive is pretty much what I would expect.
> The seven largest U.S. tech companies (...) are 20 times bigger than Europe’s seven largest, and generate 10 times more revenue.
I'm going to guess that this part is intentional. Europe tends to be more aggressive in enforcing antitrust laws. Economically, Europe's goal isn't to have the biggest companies but to have more smaller companies.
So you're not going to get companies like Google, but you will get companies like Proton, Spotify, Tuta, Hetzner, Mistral, Threema, Filen, Babbel, Nextcloud, CryptPad, DeepL, Vivaldi, and so on.
Also, commercial software is consistently behind from open source.
I only use open source LLMs for writing (Qwen 32b from Groq) and open source editor of course, Emacs.
If some people can write better using commercial LLMs (and commercial editors), by all means, but they put themselves at a disadvantage.
Next step for me, is to use something open source for translation, I use Claude for the moment, and open source for programming, I use GPT curently. In less than a year I will find a satisfying solution to both of these problems. I haven't looked deep enough.
As a European citizen I think it boils down to access to the capital. EU/EEA is not a country and the market is sort of fragmented. The big players are UK, France, Germany, everyone else does not have the same access to money as say in the US. Folks want to do it but there is a glass ceiling. Hence you have these collabs among large institutions to tap into funds such as from Horizon which are academic in nature and do not translate well into products.
The fun part is that people whining about not being able to raise common capital and operate across whole EU due to regulation tend to also be the most rabid opposers of any kind of common regulation that would bring whole EU into alignment and make it a less fragmented market.
You can easily fit below 10 billion for the whole datacenter, then you only pay for electricity + maintenance + staff. 100k GPUs cost a few billion USD, that's more than enough to train frontier models, run experiments, and serve models in the EU to start. Look at what xAI did and how much it cost them and it's more expensive to do in US than in EU.
EU made a >900 page law about AI and patted themselves on the back for being "the first to regulate AI" (which was not even true, China had an AI law before and it's two pages long).
This cannot be stressed enough. In my experience working in multiple tech startups in Germany, the power compliance, legal and all other 2nd line has over engineering is quite immense. Most of the time they act as a hindrance for innovation rather than a supporting factor.
This AI law is a clear example of that. Pencil pushers creating more obstacles for the sake of creating more obstacles rather than actually taking a pragmatic approach.
Because the value of these models is (actually) yet to be proven. Why saturate the market with something that we already have at least one of and others are selling as a service? No model provider (including the "big ones" like OpenAI) has been able to produce a viable business case. They're all literally running on government deals and investor money.
Does it even make sense? Just use the American or Chinese ones, adjust
As needed. Where’s the point in spending millions to build
The same thing or worse
Maltese, interestingly, is the only Afro-Asiatic derived language.
Hungarian, Finnish, and Estonian are the three Uralic languages.
All the others are Indo-European, Greek being the only Hellenic one, Irish the only Celtic, the rest are Baltic, Slavic, Italic, or Germanic.
(I originally used the term Balto-Slavic, though I was unaware of some of the connotations of that term until just now. Baltic and Slavic do share a common origin, but that was a very very long time ago)
It's Semitic, to be precise.
https://en.wikipedia.org/wiki/Semitic_languages
Best get to retraining those models.
They get certain recognition, but they are not official in Europe. For example, just from Spain there are 13 languages on that list.
“Brea, bûter en griene tsiis is goed Ingelsk en goed Frysk”
Specifically, the link says two things:
1. That 2 parties want to add *limburgish* to the list, not frisian. That's the bottom-right part of The Netherlands, about as far removed from Friesland as you can get (which is the top part of the Netherlands).
2. That one party wants to add Frisian, but, that is a one-day fly party that will cease to exist in a few hours as they will get 0 seats in this election and will presumably call it a day right after. It was a party founded to support one person and that person has quit due to workstress, and is highly unlikely to return as this _was_ his return. Their opinion used to be relevant as they had 13.3% of the seats this past session (and didn't exist before it). But, it isn't here.
https://tatoeba.org/
I do wonder what natives think and feel about the longevity of their language? What is taught in schools at what ages (assuming English is in the mix somewhere). Is there enough media in Maltese for Malti to go about the moderns at fully in Maltese? It’s shockingly hard to find any information on Maltese, and even harder to find content.
I’m not sure if’s dying out, or in danger thereof; if there are preservation efforts, or if there is no need.
How much do you consider Maltese its own language (as opposed to a dialect of Arabic)?
I updated my original comment, and learned a good amount about that dispute as a result, so thanks for calling it out.
Are there really any other Hellenic languages besides Greek?
> as well as some additional relevant languages (Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian).
https://arxiv.org/pdf/2409.16235
The paper also goes into detail on training set sources, which I feel like a curation thereof might be considered the main contribution of this publication?
What about Basque? Is that too controversial?
[0] https://en.wikipedia.org/wiki/Hotel_Beau_Séjour
So for instance, Basque is not an official language of any country (only French in France and Spanish/Castilian in Spain). Belgium's official languages are French, Dutch, and German, "Flemish" is only a local variant of Dutch (Belgian French is also only a local variant of French).
I have often joked that Norwegian is just a dialect of Swedish, but I never expected to get official validation like this!
That being said, the Scandinavian languages all come from old Norse, and modern national constructs aside, most of the people in the those areas descend from the same mix of Germanic tribes. There's no denying that modern-day Danish, Norwegian and Swedish are very similar.
Deleted Comment
Deleted Comment
https://en.wikipedia.org/wiki/Graecians
[0] https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:01...
PS: Gaelic is a more general term for Irish and Scottish. Ireland brings specifically Irish(Gaeilge in Irish) language.
Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.
https://huggingface.co/utter-project/EuroLLM-9B
0: https://sites.google.com/view/eurollm/home
>Europe is the only continent in the world to have a large public network of supercomputers that are managed by the EuroHPC Joint Undertaking (EuroHPC JU). As soon as we received the EuroHPC JU access to the supercomputer, we were ready to roll up our sleeves and get to work. We developed the small model right away and in less than 6 months the second model was ready.
[1] https://www.eurohpc-ju.europa.eu/eurohpc-success-story-speak...
Repurposing some of that physics sim compute
Who would have thought that Europe is the only continent to have a network of supercomputers managed by Europe⸮
But they were not trained on government-sanctioned homegrown EU data.
ok what are you implying on this
If none of the LLM makers used the very big corpus of EU multilingual data I have an EU regulation bridge to sell it to you
In recent LLMs, filtered internet text is at the low end of the quality spectrum. The higher end is curated scientific papers, synthetic and rephrased text, RLHF conversations, reasoning CoTs, etc. English/Chinese/Python/JavaScript dominate here.
The issue is that when there's a difference in training data quality between languages, LLMs likely associate that difference with the languages if not explicitly compensated for.
IMO it would be far more impactful to generate and publish high-quality data for minority languages for current model trainers, than to train new models that are simply enriched with a higher percentage of low-quality internet scrapings for the languages.
I have to admit I have not encountered significant mistokenization issues in Japanese, but I'm not using it on a daily basis LLMs. I'm somewhat dobutful this can be a major issue, since frontier LLMs are absolutely in love with Emoji, and Emoji requires at least 4 UTF-8 bytes, while most Japanese characters are happy with just 3 bytes.
Plus all your T&S/AI Safety is not solved with translation, you need lexicons and data sets of examples.
Like, people use someone in Malaysia, to label the Arabic spoken by someone playing a video game in Doha - the cultural context is missing.
The best proxy to show the degree of lopsidedness was from this : https://cdt.org/insights/lost-in-translation-large-language-...
Which in turn had to base it on this: https://stats.aclrollingreview.org/submissions/linguistic-di...
From what I am aware of, LLM capability degrades once you move out of English, and many nation states are either building, or considering the option of building their own LLMs.
Chat GPT for example tends to start emails with "ich hoffe, es geht dir gut!", which means "I hope you are well!". In English (especially American) corporate emails this is a really common way to start an email. In German it is not as "how are you" isn't a common phrase used here.
But also European culture could maybe make a difference? You can already see big differences between Grok and ChatGPT in terms of values.
European culture is already embedded in all the models, unless the people involved in this project have some hidden trove of private data that they're training on which diverges drastically from things Europeans have published publicly (I'm 99.9% positive they don't...especially given Europe's alarmist attitude around anything related to data).
I think people don't understand a huge percentage of the employees at OpenAI, Anthropic, etc. are non-US born.
Deleted Comment
Comparison with similar EU models + 600 other highlights:
https://lifearchitect.ai/models-table/
I think LLMs may be on the whole very positive for endangered languages such as Irish, but before it becomes positive I think there's an amount of danger to be navigated (see Scots Gaelic wikipedia drama for example)
In any case I think this is a great initiative.
Europe has about 1.3 times the population of the USA and about 75% of the GDP yet EU tech output is a very small percentage of US tech output. We are not talking about 70, 50, 30, or even 20%. It's a drop in the bucket.
>The seven largest U.S. tech companies, Alphabet (Google), Amazon, Apple, Meta, Microsoft, Nvidia, and Tesla, are 20 times bigger than Europe’s seven largest, and generate 10 times more revenue.
https://eqtgroup.com/thinq/technology/why-is-europes-tech-in...
"Why" is a good question, but I definitely wouldnt expect significant competition in LLMs from Europe based on the giant tech disparity. Having 1 non-cutting edge model that isn't really competitive is pretty much what I would expect.
I'm going to guess that this part is intentional. Europe tends to be more aggressive in enforcing antitrust laws. Economically, Europe's goal isn't to have the biggest companies but to have more smaller companies.
So you're not going to get companies like Google, but you will get companies like Proton, Spotify, Tuta, Hetzner, Mistral, Threema, Filen, Babbel, Nextcloud, CryptPad, DeepL, Vivaldi, and so on.
I only use open source LLMs for writing (Qwen 32b from Groq) and open source editor of course, Emacs.
If some people can write better using commercial LLMs (and commercial editors), by all means, but they put themselves at a disadvantage.
Next step for me, is to use something open source for translation, I use Claude for the moment, and open source for programming, I use GPT curently. In less than a year I will find a satisfying solution to both of these problems. I haven't looked deep enough.
i predict at some point countries will get CIA'ed when they publish plans to build a large data center.
Similar to the time when they got CIA'ed when announcing plans for new nuclear plants.
This AI law is a clear example of that. Pencil pushers creating more obstacles for the sake of creating more obstacles rather than actually taking a pragmatic approach.
Still two month earlier 19 European language model with 30B parameters got almost no mention:
https://huggingface.co/TildeAI/TildeOpen-30b
Mind you that is another open model that is begging for fine-tuning (it is not very good out of box).