Something we have had to deal with in managing educational software with a writing aspect is trying to manage what is offensive to who, in what context and where is not universal at all.
One of the most prime examples, at one point a number of terms related to homosexuality had made it onto the list at the request of a larger district. These are also terms that are being reclaimed, and it was... a difficult problem to try to satisfy everyone, and it did upset other districts. I believe their patterns were all but removed eventually.
We have a fought over the list of definitions and every change provoked controversy. Our current solution is just that we mark items for teacher review but don't tell them why. We don't say they are offensive, we don't say what the problematic words are. We just say it might need review. That's worked pretty well so far.
All this is to say, policing speech is a problem best avoided.
Unfortunately, whether or not a term is really offensive is a combination of what it is, who said it, and when/where (at least in the common-sense definition). Unfortunately, because this is directly opposed to our (at least in the US, and in most countries rooted in liberalism) sense of fairness which says that rules should be applicable universally, across all people and in all contexts.
Which is to say… policing speech is a problem best avoided!
> Unfortunately, because this is directly opposed to our (at least in the US, and in most countries rooted in liberalism) sense of fairness which says that rules should be applicable universally, across all people and in all contexts.
The US is not even internally consistent about this - the legal definition of obscenity in the US is deferential to local community standards.
I worked in completely different field and I had to give up on flagging any variations of "shit". Turns out there's working-class boomers will utter some form or another in every other sentence. Nothing harmful just like "my brother is full of horse shit", "my job is bullshit".
“Shit” is pretty good because it is crass but not offensive (in the sense that it doesn’t target any particular group). And of course it describes a lot of what’s happening nowadays.
You just know it's a losing battle because euphemisms will get increasingly creative. So the more you try to stop it, the more people will spread sheet.
some random word from the sentence gets censored out
"Why did this just got censored out?"
check urban disctionary
"Why?????"
Bonus points if its regular ethnonyms that are classified as profanities, so people from that place are having big trouble to tell where they are from.
It's certainly an interesting data set, though it has no concept of severity. As far as I can tell, "doodoo" is the same as some racial slurs: we're 100% certain they're bad words.
I legit thought this said "... rating of success" meaning how likely the project was to be successful on some metric based on the profane words therein. I recall there was a study(?) akin to that for the Linux kernel, as a frame of reference
I think the value add here is being a software package. The lists exist elsewhere and the package authors supplied sources. If you really need a combined list it should be trivial to generate it from the code.
I think in this case volume wins out in that over 90% of Portuguese speakers are Brazilian Portuguese speakers. If anything it may one day just become "Portuguese" and "European Portuguese".
At that time, we will have niche dialects "American English" and "British English". "English" will be identified with the variety spoken in India. Please kindly do the needful good sir.
One of the most prime examples, at one point a number of terms related to homosexuality had made it onto the list at the request of a larger district. These are also terms that are being reclaimed, and it was... a difficult problem to try to satisfy everyone, and it did upset other districts. I believe their patterns were all but removed eventually.
We have a fought over the list of definitions and every change provoked controversy. Our current solution is just that we mark items for teacher review but don't tell them why. We don't say they are offensive, we don't say what the problematic words are. We just say it might need review. That's worked pretty well so far.
All this is to say, policing speech is a problem best avoided.
Which is to say… policing speech is a problem best avoided!
The US is not even internally consistent about this - the legal definition of obscenity in the US is deferential to local community standards.
types something in live chat
some random word from the sentence gets censored out
"Why did this just got censored out?"
check urban disctionary
"Why?????"
Bonus points if its regular ethnonyms that are classified as profanities, so people from that place are having big trouble to tell where they are from.
https://arxiv.org/search/?query=fuck&searchtype=all&source=h...
though somebody did slip in a use in a comment earlier.
Baby shark, shitshitshit.
Baby shark!
https://www.usatoday.com/story/news/nation/2020/10/06/oklaho...
Deleted Comment
The word "European" was chosen to avoid the clash of "Portuguese Portuguese" ("português português") as opposed to Brazilian Portuguese.