Readit News logoReadit News
ftyers commented on Common Voice   commonvoice.mozilla.org/e... · Posted by u/oblib
vidarh · 2 years ago
Weird to hold off on adding a language because the UI isn't translated. Why would there be an assumption that the language people want to record is linked to preferred UI language?

I don't want Norwegian UI - I just want to be able to record Norwegian sentences. If the UI switches to Norwegian I'd be very annoyed, as I haven't indicated I want that and my browser settings specify English.

(I avoid Norwegian for UIs, because the translations are generally wildly inconsistent in how they translate key terms that I'm used to seeing in English, so it's a massive nuisance - when people assume UI and content language should be the same, that is a major failing to me)

Re: tags someone else pointed out the accent field is being used for this, even though the UI describes that as specifically for accents.

ftyers · 2 years ago
[comment removed]
ftyers commented on Common Voice   commonvoice.mozilla.org/e... · Posted by u/oblib
vidarh · 2 years ago
I submitted a request for Norwegian Bokmål, and realised a complication which I'm sure must affect other languages too:

Norway has two separate official languages. They are unusually close - one is relatively close to Danish, and the other started as a collection of dialects, but technically they are written languages, especially Bokmål which basically means "book language".

I'm unusual in that I speak close to "pure" Bokmål. Thanks to expectations at school etc., a lot of speakers who write Bokmål will adjust or tone down their dialect if asked to read a text that is written in grammatically and orthographically correct bokmål, but will otherwise speak in a manner that can deviate fairly significantly from the written language.

As such, depending on whether your goal is text to speech or speech recognition, the pronunciation you will need is very different.

E.g. people I know who write Bokmål might say something like "hva erredu ser på a?" ("what are you looking at?") with hardly any gaps between words, while I would stick close to the written "hva er det du ser på?" with clear gaps. In recognition you need to handle both (and many other variations), while for generation you'd at least by default usually want the latter unless there are indications the text is written in dialect.

It strikes me you'd really want people to write more detail about what it is they are speaking and/or let people tag/label data with additional info about accents. Not just for this, but for other multi-lingual speakers as well. E.g. it'd be helpful to have many foreign accents in the English (and other languages) dataset for recognition, but as much as I want speech recognition to understand me, I'm not particularly interested in teaching it to speak English with a strong Norwegian accent.

That is less of an issue than the dialects in some languages that can involve much more than just speaking the same words differently.

To take another example "Jeg åpnet døren og gikk ut i solen" og "Jeg åpna døra og gikk ut i sola" are both valid Bokmål. Depending on context a reader may stick strictly to the text or swap åpnet<->åpna, døren<->døra, sola<->sola, and every permutation is valid. Which exact set you use differs and some speakers will write one but use the other when speaking. E.g. I would say åpna, døra, sola, but write åpnet, døren, solen. The latter is more formal and/or old-fashioned in some parts of the country, but the perception of that also varies by region. And this totally leaves out all the dialect variations used by people who'd say their language is Bokmål, and would be recognized as such by Norwegian speakers, but who use variants of words or conjugations that aren't technically recognized as valid Bokmål.

The former is more "modern" (several of the forms are only valid Bokmål as a result of successive language reforms), more common in the Eastern part of Norway outside of the posher parts of Oslo and other wealthy regions, and (weirdly) more common in 1970's radical left-wing academics (especially people involved with the Maoist Workers Communist Party/AKP-ML) as an affectation/sociolect, with each of these groups also deviating in other aspects....

If you want to maximize the utility of a dataset like this, you really would want to let each speaker at least assign a lot of tags/labels to their profile; even if you don't want to deal with the hornet nest of trying to figure out all the distinctions, even unstructured labels would be a start, and ideally allowing people to tag individual recordings as well, because there are a lot more variations than just "language" and "accent" here.

ftyers · 2 years ago
> If you want to maximize the utility of a dataset like this, you really would want to let each speaker at least assign a lot of tags/labels to their profile; even if you don't want to deal with the hornet nest of trying to figure out all the distinctions, even unstructured labels would be a start, and ideally allowing people to tag individual recordings as well, because there are a lot more variations than just "language" and "accent" here.

This is exactly what the freeform accent (actually "variant") field is. You can add as many tags as you like. https://foundation.mozilla.org/en/blog/how-we-are-making-com...

ftyers commented on Common Voice   commonvoice.mozilla.org/e... · Posted by u/oblib
yorwba · 2 years ago
> Norwegian Bokmål

... is currently in progress. What's missing is a sufficiently complete translation of the UI https://pontoon.mozilla.org/projects/common-voice/ and a sufficiently large number of sentences for people to record https://commonvoice.mozilla.org/nb-NO/write

> let each speaker at least assign a lot of tags/labels to their profile

Common Voice data files have columns for age, gender, accents, variant, locale and segment. (Not sure what that last one is.) These are per recording, but I'm pretty sure they're the same for all recordings by the same speaker.

ftyers · 2 years ago
Target segment was a was of including specific subdatasets. For example the digits dataset which was just the digits 0-9 and yes/no.
ftyers commented on The diamond world takes radical steps to stop a pricing plunge   bloomberg.com/news/articl... · Posted by u/toomuchtodo
xivzgrev · 2 years ago
Guys, buying diamonds, can complain all day about what a rip off it is (it is).

But if you apprise your fiancé of the alternatives / opp costs (moissanite, man made, second hand, etc) and she still wants a new diamond, just get her a new diamond. It will make her happy for years.

My wife had her heart set on a pink diamond, so that’s what I got her. I made it clear though that if I got this, she is not getting a new one for a long time (she has a tendency to want to rent / sample / change things up). So to her credit she did her homework and spent a few months picking the best one.

Now everytime we go out and she wears it, I can see how she’s happy. It’s worth it to me to see a gift keep giving.

ftyers · 2 years ago
That's what I thought... The price amortises over the length of the marriage, so it actually works out fairly reasonably.
ftyers commented on Codex documenting Aztec culture now digitized   hyperallergic.com/855683/... · Posted by u/diodorus
nathancahill · 2 years ago
That's really cool. I went deep into epigraphy when I lived in Guatemala and was regularly finding carved pottery in our garden beds. Spent a lot of time annotating paper copies of glyphs.

What's the way to view/edit the .conllu files?

ftyers · 2 years ago
They're generated from a tonne of Python scripts, but feel free to get in contact and I can take you through how it works ! :)
ftyers commented on Codex documenting Aztec culture now digitized   hyperallergic.com/855683/... · Posted by u/diodorus
ftyers · 2 years ago
The sad thing is that there isn't really anything new here. It's the Anderson and Dibble translation, and some random extra stuff. For 15 years work it's quite a limited contribution. In addition, it's not freely licensed. I'm working on a free/open-source licensed edition with linguistic annotation. If anyone is interested, ask for the link, it's on GitHub.
ftyers · 2 years ago
It's also available online at UNAM, https://temoa.iib.unam.mx/ along with a lot of other texts (see pages beginning with CF "Códice Florentino"), e.g. https://temoa.iib.unam.mx/cf_05_v
ftyers commented on Codex documenting Aztec culture now digitized   hyperallergic.com/855683/... · Posted by u/diodorus
asimpletune · 2 years ago
Isn't it also the "García Garagarza" Spanish-English translation? Or is that what you mean by some extra stuff.
ftyers · 2 years ago
Yeah, that's what I mean, e.g. it's mostly existing published stuff. The new stuff is some partial summarisation in Eastern Huasteca Nahuatl, and some spoken audio (by EHN speakers), although, it's unclear what the audio gains. Without training, it's not really intelligible to most speakers of modern varieties.
ftyers commented on Codex documenting Aztec culture now digitized   hyperallergic.com/855683/... · Posted by u/diodorus
tastyfreeze · 2 years ago
It makes me very happy to see ancient texts digitized for the world to see.

If only the Spaniards hadn't burned every codex they could get their hands on. There are 12 surviving pre-contact Aztec codexes. The Aztecs produced 480000 sheets of paper annually. That amounts to heaps of records and knowledge that was destroyed.

ftyers · 2 years ago
True, but it's also interesting to see what survived. My wife is a Nahuatl speaker, and some of the stuff in the book on Omens is still part of the culture, e.g. About owls being a sign of death.
ftyers commented on Codex documenting Aztec culture now digitized   hyperallergic.com/855683/... · Posted by u/diodorus
ftyers · 2 years ago
Here is an example of what were producing so far for book 5 "the omens": https://github.com/ftyers/UD_Classical_Nahuatl-FloCo/blob/ma...

The repo is kind of messy, it's mostly me and some students working on it, but we're pretty passionate. Let me know if you'd like to get involved! :)

ftyers · 2 years ago
Note that the annotation is alignable with images of the original manuscript, which are online at the Library of Congress. I.e. https://www.loc.gov/item/2021667850/

The [orig] contains all the original token and line breaks.

u/ftyers

KarmaCake day499August 13, 2019View Original