Assuming Blink is using the same technique for text selection as V8 is for the public Intl.v8BreakIterator method, that's how Chrome's handling this-- Intl.v8BreakIterator is a pretty thin wrapper around the ICU BreakIterator implementation: https://chromium.googlesource.com/v8/v8/+/refs/heads/master/...
and then according to the LICENSE file[1], the dictionary :
# The word list in cjdict.txt are generated by combining three word lists
# listed below with further processing for compound word breaking. The
# frequency is generated with an iterative training against Google web
# corpora.
#
# * Libtabe (Chinese)
# - https://sourceforge.net/project/?group_id=1519
# - Its license terms and conditions are shown below.
#
# * IPADIC (Japanese)
# - http://chasen.aist-nara.ac.jp/chasen/distribution.html
# - Its license terms and conditions are shown below.
#
It's interesting to see some of the other techniques used in that engine, such as a special function to figure out the weights of potential katakana word splits.
According to your first link, BreakIterator uses a dictionary for several languages, including Japanese. So I guess the full answer is something like:
Chrome uses v8's Intl.v8BreakIterator which uses icu::BreakIterator, which, for Japanese text, uses a big long list of Japanese words to try to figure out what is a word and what isn't. I've worked on a similar segmenter for Chinese and yeah, quality isn't great but it works in enough cases to be useful.
Firefox, in contrast, breaks at script boundaries, so it'll select runs of Hiragana, Katakana, and Kanji. Not nearly as useful, and definitely makes copying Japanese text especially annoying.
Personally I find double click highlighting to be a useless feature in any language, but the Firefox approach is superior imo. Breaking at script boundaries is predictable behavior the user can anticipate whereas doing some janky ad hoc natural language processing invariably results in behavior that is essentially random from a user perspective. I tried out double click highlighting on some basic Japanese sentences on Chromium and it failed to highlight any of what would be considered words.
It's not like English highlighting does complex grammatical analysis to make sure "Project Manager" gets highlighted as one chunk and "eventUpdate" gets highlighted as two chunks, most implementations just breaks at spaces like the user expects.
I use double-click highlighting, and the reason is mostly selecting passages of text when editing. Double-click highlighting makes it so I don't have to find the precise character boundaries for the first and last words in the passage. Instead, I can just double click the first word, roughly move my mouse to hover over the last, and copy or delete that entire passage.
Firefox's approach is fairly useless in this regard. Even if it's predictable from a technical perspective, it's not predictable for a reader who naturally processes semantic breaks rather than technical ones. Unlike in English, where a space is both semantic and visual, hiragana-kanji boundaries often don't mean anything. As a result, for me at least, Firefox's breaks feel a lot more random than Chrome's, which, while dodgy, are often fine.
Having used Firefox as my main browser since 2006, I remember when I discovered this feature in Chrome, and being shocked at how much of an effect that minor improvement had for me. It's not a deal-breaker, certainly, but it's become my one big annoyance with Firefox.
I'm Japanese and I agree that Firefox behavior makes sense.
For example, a text with kanji like "その機能拡張は、", A word "機能拡張" consists of "機能" + "拡張" words.
In Chrome, double-click makes individual part (like 機能) selected which is rarely wanted behavior for me.
In Firefox, the whole word (機能拡張) is selected, which is wanted mostly.
I think all of this just highlights (hah) that the way we think of human-language strings needs to change. They're not a stream of characters, they're their own thing with complex semantic rules. They should be represented as an opaque blob only manipulated via an API or passed to an OS for rendering etc.
Machine-readable strings can still just be an ASCII byte array, but we need to keep the two separate.
I feel like this is a conclusion you could only reach by having an irrational compulsion to defend the deficiencies of Firefox, and not being a regular user of the Japanese language.
I actually like this behavior more, since it is predictable. Sometimes it just works for occasional looking up proper nouns, and you can already tell if it won't.
I think it depends on how engrained the double-click highlight is for you. For me, I double-click by default, since I almost always want to select at a word boundary in English. As a result, when I need to select Chinese or Japanese text, I'm always annoyed when my double click (which, in my mind, should always select a word) selects a nonsensical sub-sentence instead, and I have to then re-select it manually.
Side note, this already doesn't work if you don't have a Touchpad / Magic mouse. The normal workflow is Right Click ==> Look Up, but Firefox overrides macOS's normal right click menu. :(
Firfox is just not a very good macOS citizen, sadly.
I actually work with Japanese tokenization a lot - I took over maintenance of the most popular Python Mecab wrapper last year, and I have another Cython-based wrapper that I maintain.
Word boundary awareness for Japanese is a pretty uncommon feature in applications, so I was surprised to see the feature had been in Chrome all along, even if it's buried and the quality has issues.
Anyway, thanks to everyone who tracked down the icu imlementation and the relevant part of Chrome!
This is often determine by Unicode and not the browsers specifically (though some browsers could override the suggested Unicode approach).
Each unicode character has certain properties, one of which is whether that character indicates a break before / after itself.
I've done extensive research on this for my job, but unfortunately don't have time to do the whole writeup here. Here are several resources for those who are interested
Overall, word / line breaking in Unicode in no-space languages is a very difficult problem. Where the UCD says there can be a line break isn't where a native speaker would put one. In order to do it correctly you have to bring in Natural Language Processing, but that has its own set of complexities.
The library that Chrome uses seems to use a dictionary[0], since you can't determine word boundaries in Japanese just by looking at two characters.
Your first link also says:
> To handle certain situations, some line breaking implementations use techniques that cannot be expressed within the framework of the Unicode Line Breaking Algorithm. Examples include using dictionaries of words for languages that do not use spaces
Yea, CJK (Chinese, Japanese, Korean) breaking is particularly complex. Google has done a lot of work, and have this open source implementation which uses NLP. It's the best I've personally come across:
Yes. This seems to work even when you pass it Chinese while maintaining ja-JP as the language
function tokenizeJA(text) {
var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
it.adoptText(text)
var words = []
var cur = 0, prev = 0
while (cur < text.length) {
prev = cur
cur = it.next()
words.push(text.substring(prev, cur))
}
return words
}
console.log(tokenizeJA("今天要去哪裡?"))
still seems to parse just fine. so most likely just using the passed input to parse.
Yep, the browser has the UCD info built into it (a simplification... but basically). Similarly our mobile devices and various backend languages have the same data backed into it.
This is where there are sometimes discrepancies between how a given browser or device would output this data, as it could be working off of an outdated version of Unicode's data.
Some devices even overwrite the default Unicode behavior. There are just SO many languages and SO many regions and SO many combinations thereof that even Unicode can't cover all the bases. It's all very fascinating from an engineering perspective.
It turns out that's because ICU uses a combined Chinese/Japanese dictionary instead of separate dictionaries for each language. Which probably is a little more robust if you misdetect some Chinese text as Japanese and vice-versa.
> The quality is not amazing but I'm surprised this is supported at all.
I find this line hilarious for some reason. Reminds me of the line about being a tourist in France, "French people don't expect you to speak French, but they appreciate it when you try"
Being a resident of France I have to say that French people are the opposite of that. Same for Japanese people. The only people who get genuinely excited as you butcher their language are Arabic speakers I've noticed.
As a tourist in Paris about 10 years ago, I felt I had better interactions when I butchered the language and then switched to english after the person I was speaking with did, than when I just started off in english.
I only knew a handful of phrases though, so anything off script and I was pretty lost.
Yeah, I think the intent of the phrase is to convey the condescending attitude you might encounter, but expressed it in a subtle way. Maybe too subtle!
That's funny. Not at all the way I read it, but it could totally be read that way. Made me do a double take.
Still, I think my original reading is correct, because I don't think there is any issue with the "quality" of v8 inside of jsfiddle. While imagining Chrome doing its best to identify real words in long strings of Japanese text and failing spectacularly just made me laugh again.
OP here, I was not referring to it working JSFiddle, just the idea of Chrome shipping with a Japanese tokenizer. I've known people to do stuff like compile MeCab with emscripten to get Japanese tokenization in Electron, so the fact that the functionality was there already (even with lower quality) was surprising.
I've been doing some work parsing Vietnamese text, which has the opposite problem. Compound words (which is most of the vocabulary) are broken up into their components by spaces, indistinquishable from the boundaries between words.
Yes, that's how it is written in Vietnamese. To oversimplify: Vietnamese words are a collection of single syllables that are always separated by a space when writing.
"Viet Nam" is also, actually, the "official" English way to write it. (Check how the UN puts it on all their stuff.) However, most Europeans don't do that in their languages, so it usually gets written as Vietnam even by Vietnamese when they're writing European languages.
Given this property of Japanese text, is there wordplay associated with a string of characters with double/reverse meanings depending on how the characters are combined?
Not exactly double / reverse meanings, but there are several setences that are traditionally used to test Japanese tokenizers. Most of them are tongue twisters, like this one:
すももももももももの内
"Japanese plums (sumomo) and peaches are both kinds of peaches"
In speech intonation would make the word boundaries here clear, but in writing it looks odd.
This isn't a tongue twister, but tokenizers often fail on it:
外国人参政権は難しい問題だ
"Voting rights for foreigners is a complex problem."
外国人 (foreigner) / 参政権 (gov participation rights) is the right tokenization, but a common error is to parse it as 外国 (foreign) / 人参 (carrot) / 政権 (political power).
Yes. One that comes to mind is "[name]さんじゅうななさい" which can be either interpreted as "[name]-san, 17 years old" or "[name], 37 years old", depending on whether you interpret the さん(san) as an honorific or part of a number. (The sentence would usually be written in a combination of hiragana and kanji, but is intentionally written in all hiragana here to ensure ambiguity.)
Another one: "この先生きのこるには", which should be broken up at "この先/生きのこるには" to mean "To keep surviving going forward", but since 先生 (teacher) is such a common word, it jumps at your eyes and turns the sentence into "この先生/きのこるには" which means the nonsense "For this teacher to mushroom(verb)". Usually this doesn't happen because the "survive" part is written with more kanji as 生き残る, but here it is written in hiragana to make the きのこ(mushroom) part visible and further mess with lexing.
In both cases, some liberties have been taken with notation to intentionally encourage silly mis-readings; It happens much less often in ordinary text.
Assuming Blink is using the same technique for text selection as V8 is for the public Intl.v8BreakIterator method, that's how Chrome's handling this-- Intl.v8BreakIterator is a pretty thin wrapper around the ICU BreakIterator implementation: https://chromium.googlesource.com/v8/v8/+/refs/heads/master/...
and then according to the LICENSE file[1], the dictionary :
It's interesting to see some of the other techniques used in that engine, such as a special function to figure out the weights of potential katakana word splits.[1] https://github.com/unicode-org/icu/blob/6417a3b720d8ae3643f7...
Chrome uses v8's Intl.v8BreakIterator which uses icu::BreakIterator, which, for Japanese text, uses a big long list of Japanese words to try to figure out what is a word and what isn't. I've worked on a similar segmenter for Chinese and yeah, quality isn't great but it works in enough cases to be useful.
https://pingtype.github.io
https://bugzilla.mozilla.org/show_bug.cgi?id=820261
It's not like English highlighting does complex grammatical analysis to make sure "Project Manager" gets highlighted as one chunk and "eventUpdate" gets highlighted as two chunks, most implementations just breaks at spaces like the user expects.
Firefox's approach is fairly useless in this regard. Even if it's predictable from a technical perspective, it's not predictable for a reader who naturally processes semantic breaks rather than technical ones. Unlike in English, where a space is both semantic and visual, hiragana-kanji boundaries often don't mean anything. As a result, for me at least, Firefox's breaks feel a lot more random than Chrome's, which, while dodgy, are often fine.
Having used Firefox as my main browser since 2006, I remember when I discovered this feature in Chrome, and being shocked at how much of an effect that minor improvement had for me. It's not a deal-breaker, certainly, but it's become my one big annoyance with Firefox.
Machine-readable strings can still just be an ASCII byte array, but we need to keep the two separate.
Firfox is just not a very good macOS citizen, sadly.
I actually work with Japanese tokenization a lot - I took over maintenance of the most popular Python Mecab wrapper last year, and I have another Cython-based wrapper that I maintain.
Word boundary awareness for Japanese is a pretty uncommon feature in applications, so I was surprised to see the feature had been in Chrome all along, even if it's buried and the quality has issues.
Anyway, thanks to everyone who tracked down the icu imlementation and the relevant part of Chrome!
Each unicode character has certain properties, one of which is whether that character indicates a break before / after itself.
I've done extensive research on this for my job, but unfortunately don't have time to do the whole writeup here. Here are several resources for those who are interested
Info on break opportunities:
https://unicode.org/reports/tr14/#BreakOpportunities
The entire Unicode Character Database (~80MB XML file last I checked)
https://unicode.org/reports/tr44/
The properties within the UCD are hard to parse, here's a reference if you're interested:
https://unicode.org/reports/tr14/#Table1
https://www.unicode.org/Public/5.2.0/ucd/PropertyAliases.txt
https://www.unicode.org/Public/5.2.0/ucd/PropertyValueAliase...
Overall, word / line breaking in Unicode in no-space languages is a very difficult problem. Where the UCD says there can be a line break isn't where a native speaker would put one. In order to do it correctly you have to bring in Natural Language Processing, but that has its own set of complexities.
In summary: I18N is hard!
Your first link also says:
> To handle certain situations, some line breaking implementations use techniques that cannot be expressed within the framework of the Unicode Line Breaking Algorithm. Examples include using dictionaries of words for languages that do not use spaces
[0] posted in another top-level comment: http://userguide.icu-project.org/boundaryanalysis
https://github.com/google/budou
This is where there are sometimes discrepancies between how a given browser or device would output this data, as it could be working off of an outdated version of Unicode's data.
Some devices even overwrite the default Unicode behavior. There are just SO many languages and SO many regions and SO many combinations thereof that even Unicode can't cover all the bases. It's all very fascinating from an engineering perspective.
I find this line hilarious for some reason. Reminds me of the line about being a tourist in France, "French people don't expect you to speak French, but they appreciate it when you try"
I only knew a handful of phrases though, so anything off script and I was pretty lost.
Because I was a bit surprised about that and made me wonder if opening this JSFiddle on Safari would work at all (I’m on a phone so I can’t test).
Still, I think my original reading is correct, because I don't think there is any issue with the "quality" of v8 inside of jsfiddle. While imagining Chrome doing its best to identify real words in long strings of Japanese text and failing spectacularly just made me laugh again.
TypeError: Intl.v8BreakIterator is not a function. (In 'Intl.v8BreakIterator(['ja-JP'], {type:'word'})', 'Intl.v8BreakIterator' is undefined)
[0]: http://www.solutions.asia/2016/10/japanese-tokenization.html...
"Viet Nam" is also, actually, the "official" English way to write it. (Check how the UN puts it on all their stuff.) However, most Europeans don't do that in their languages, so it usually gets written as Vietnam even by Vietnamese when they're writing European languages.
すももももももももの内
"Japanese plums (sumomo) and peaches are both kinds of peaches"
In speech intonation would make the word boundaries here clear, but in writing it looks odd.
This isn't a tongue twister, but tokenizers often fail on it:
外国人参政権は難しい問題だ
"Voting rights for foreigners is a complex problem."
外国人 (foreigner) / 参政権 (gov participation rights) is the right tokenization, but a common error is to parse it as 外国 (foreign) / 人参 (carrot) / 政権 (political power).
In both cases, some liberties have been taken with notation to intentionally encourage silly mis-readings; It happens much less often in ordinary text.