Readit News logoReadit News
JonathonW · 6 years ago
ICU (International Components for Unicode) provides an API for this: http://userguide.icu-project.org/boundaryanalysis

Assuming Blink is using the same technique for text selection as V8 is for the public Intl.v8BreakIterator method, that's how Chrome's handling this-- Intl.v8BreakIterator is a pretty thin wrapper around the ICU BreakIterator implementation: https://chromium.googlesource.com/v8/v8/+/refs/heads/master/...

chch · 6 years ago
Doing a bit more deep diving into the ICU code, it looks like the source code for the Break engine (used by Chinese, Japanese, and Korean) is here: https://github.com/unicode-org/icu/blob/778d0a6d1d46faa724ea...

and then according to the LICENSE file[1], the dictionary :

   #  The word list in cjdict.txt are generated by combining three word lists
   # listed below with further processing for compound word breaking. The
   # frequency is generated with an iterative training against Google web
   # corpora.
   #
   #  * Libtabe (Chinese)
   #    - https://sourceforge.net/project/?group_id=1519
   #    - Its license terms and conditions are shown below.
   #
   #  * IPADIC (Japanese)
   #    - http://chasen.aist-nara.ac.jp/chasen/distribution.html
   #    - Its license terms and conditions are shown below.
   #

It's interesting to see some of the other techniques used in that engine, such as a special function to figure out the weights of potential katakana word splits.

[1] https://github.com/unicode-org/icu/blob/6417a3b720d8ae3643f7...

erjiang · 6 years ago
According to your first link, BreakIterator uses a dictionary for several languages, including Japanese. So I guess the full answer is something like:

Chrome uses v8's Intl.v8BreakIterator which uses icu::BreakIterator, which, for Japanese text, uses a big long list of Japanese words to try to figure out what is a word and what isn't. I've worked on a similar segmenter for Chinese and yeah, quality isn't great but it works in enough cases to be useful.

peterburkimsher · 6 years ago
What did you use for your Chinese word segmentation? I wrote Pingtype myself to help me learn by splitting words and translating them in parallel.

https://pingtype.github.io

TheChaplain · 6 years ago
For firefox there's a bug discussing the pros and cons of using ICU for boundaries, so at least they are aware of the issue.

https://bugzilla.mozilla.org/show_bug.cgi?id=820261

wwarner · 6 years ago
was looking for that last link myself -- thanks!
trnglina · 6 years ago
Firefox, in contrast, breaks at script boundaries, so it'll select runs of Hiragana, Katakana, and Kanji. Not nearly as useful, and definitely makes copying Japanese text especially annoying.
knolax · 6 years ago
Personally I find double click highlighting to be a useless feature in any language, but the Firefox approach is superior imo. Breaking at script boundaries is predictable behavior the user can anticipate whereas doing some janky ad hoc natural language processing invariably results in behavior that is essentially random from a user perspective. I tried out double click highlighting on some basic Japanese sentences on Chromium and it failed to highlight any of what would be considered words.

It's not like English highlighting does complex grammatical analysis to make sure "Project Manager" gets highlighted as one chunk and "eventUpdate" gets highlighted as two chunks, most implementations just breaks at spaces like the user expects.

trnglina · 6 years ago
I use double-click highlighting, and the reason is mostly selecting passages of text when editing. Double-click highlighting makes it so I don't have to find the precise character boundaries for the first and last words in the passage. Instead, I can just double click the first word, roughly move my mouse to hover over the last, and copy or delete that entire passage.

Firefox's approach is fairly useless in this regard. Even if it's predictable from a technical perspective, it's not predictable for a reader who naturally processes semantic breaks rather than technical ones. Unlike in English, where a space is both semantic and visual, hiragana-kanji boundaries often don't mean anything. As a result, for me at least, Firefox's breaks feel a lot more random than Chrome's, which, while dodgy, are often fine.

Having used Firefox as my main browser since 2006, I remember when I discovered this feature in Chrome, and being shocked at how much of an effect that minor improvement had for me. It's not a deal-breaker, certainly, but it's become my one big annoyance with Firefox.

fomine3 · 6 years ago
I'm Japanese and I agree that Firefox behavior makes sense. For example, a text with kanji like "その機能拡張は、", A word "機能拡張" consists of "機能" + "拡張" words. In Chrome, double-click makes individual part (like 機能) selected which is rarely wanted behavior for me. In Firefox, the whole word (機能拡張) is selected, which is wanted mostly.
taneq · 6 years ago
I think all of this just highlights (hah) that the way we think of human-language strings needs to change. They're not a stream of characters, they're their own thing with complex semantic rules. They should be represented as an opaque blob only manipulated via an API or passed to an OS for rendering etc.

Machine-readable strings can still just be an ASCII byte array, but we need to keep the two separate.

microcolonel · 6 years ago
I feel like this is a conclusion you could only reach by having an irrational compulsion to defend the deficiencies of Firefox, and not being a regular user of the Japanese language.
jack1243star · 6 years ago
I actually like this behavior more, since it is predictable. Sometimes it just works for occasional looking up proper nouns, and you can already tell if it won't.
trnglina · 6 years ago
I think it depends on how engrained the double-click highlight is for you. For me, I double-click by default, since I almost always want to select at a word boundary in English. As a result, when I need to select Chinese or Japanese text, I'm always annoyed when my double click (which, in my mind, should always select a word) selects a nonsensical sub-sentence instead, and I have to then re-select it manually.
zeroimpl · 6 years ago
It also prevents the dictionary lookup gesture of macOS from working in Firefox, since it selects the whole sentence and looks that up (which fails).
Wowfunhappy · 6 years ago
Side note, this already doesn't work if you don't have a Touchpad / Magic mouse. The normal workflow is Right Click ==> Look Up, but Firefox overrides macOS's normal right click menu. :(

Firfox is just not a very good macOS citizen, sadly.

polm23 · 6 years ago
OP here, surprised to see this took off.

I actually work with Japanese tokenization a lot - I took over maintenance of the most popular Python Mecab wrapper last year, and I have another Cython-based wrapper that I maintain.

Word boundary awareness for Japanese is a pretty uncommon feature in applications, so I was surprised to see the feature had been in Chrome all along, even if it's buried and the quality has issues.

Anyway, thanks to everyone who tracked down the icu imlementation and the relevant part of Chrome!

LikeAnElephant · 6 years ago
This is often determine by Unicode and not the browsers specifically (though some browsers could override the suggested Unicode approach).

Each unicode character has certain properties, one of which is whether that character indicates a break before / after itself.

I've done extensive research on this for my job, but unfortunately don't have time to do the whole writeup here. Here are several resources for those who are interested

Info on break opportunities:

https://unicode.org/reports/tr14/#BreakOpportunities

The entire Unicode Character Database (~80MB XML file last I checked)

https://unicode.org/reports/tr44/

The properties within the UCD are hard to parse, here's a reference if you're interested:

https://unicode.org/reports/tr14/#Table1

https://www.unicode.org/Public/5.2.0/ucd/PropertyAliases.txt

https://www.unicode.org/Public/5.2.0/ucd/PropertyValueAliase...

Overall, word / line breaking in Unicode in no-space languages is a very difficult problem. Where the UCD says there can be a line break isn't where a native speaker would put one. In order to do it correctly you have to bring in Natural Language Processing, but that has its own set of complexities.

In summary: I18N is hard!

erjiang · 6 years ago
The library that Chrome uses seems to use a dictionary[0], since you can't determine word boundaries in Japanese just by looking at two characters.

Your first link also says:

> To handle certain situations, some line breaking implementations use techniques that cannot be expressed within the framework of the Unicode Line Breaking Algorithm. Examples include using dictionaries of words for languages that do not use spaces

[0] posted in another top-level comment: http://userguide.icu-project.org/boundaryanalysis

LikeAnElephant · 6 years ago
Yea, CJK (Chinese, Japanese, Korean) breaking is particularly complex. Google has done a lot of work, and have this open source implementation which uses NLP. It's the best I've personally come across:

https://github.com/google/budou

swang · 6 years ago
Yes. This seems to work even when you pass it Chinese while maintaining ja-JP as the language

  function tokenizeJA(text) {
    var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
    it.adoptText(text)
    var words = []
  
    var cur = 0, prev = 0
  
    while (cur < text.length) {
      prev = cur
      cur = it.next()
      words.push(text.substring(prev, cur))
    }
  
    return words
  
  }

  console.log(tokenizeJA("今天要去哪裡?"))
still seems to parse just fine. so most likely just using the passed input to parse.

yorwba · 6 years ago
The underlying library is actually using a single dictionary for both Chinese and Japanese https://github.com/unicode-org/icu/tree/7814980f51bca2000a96...
LikeAnElephant · 6 years ago
Yep, the browser has the UCD info built into it (a simplification... but basically). Similarly our mobile devices and various backend languages have the same data backed into it.

This is where there are sometimes discrepancies between how a given browser or device would output this data, as it could be working off of an outdated version of Unicode's data.

Some devices even overwrite the default Unicode behavior. There are just SO many languages and SO many regions and SO many combinations thereof that even Unicode can't cover all the bases. It's all very fascinating from an engineering perspective.

erjiang · 6 years ago
It turns out that's because ICU uses a combined Chinese/Japanese dictionary instead of separate dictionaries for each language. Which probably is a little more robust if you misdetect some Chinese text as Japanese and vice-versa.
darkerside · 6 years ago
> The quality is not amazing but I'm surprised this is supported at all.

I find this line hilarious for some reason. Reminds me of the line about being a tourist in France, "French people don't expect you to speak French, but they appreciate it when you try"

curiousgal · 6 years ago
Being a resident of France I have to say that French people are the opposite of that. Same for Japanese people. The only people who get genuinely excited as you butcher their language are Arabic speakers I've noticed.
toast0 · 6 years ago
As a tourist in Paris about 10 years ago, I felt I had better interactions when I butchered the language and then switched to english after the person I was speaking with did, than when I just started off in english.

I only knew a handful of phrases though, so anything off script and I was pretty lost.

mattigames · 6 years ago
I hate that, how are they supposed to improve their language level if not by trying to speak it.
darkerside · 6 years ago
Yeah, I think the intent of the phrase is to convey the condescending attitude you might encounter, but expressed it in a subtle way. Maybe too subtle!
superasn · 6 years ago
Same for Indian people. Everyone gets super excited when foreigners speak even one single sentence in Hindi.
whoisjuan · 6 years ago
Was OP referring to the fact that V8 namespace is available inside JSFiddle?

Because I was a bit surprised about that and made me wonder if opening this JSFiddle on Safari would work at all (I’m on a phone so I can’t test).

darkerside · 6 years ago
That's funny. Not at all the way I read it, but it could totally be read that way. Made me do a double take.

Still, I think my original reading is correct, because I don't think there is any issue with the "quality" of v8 inside of jsfiddle. While imagining Chrome doing its best to identify real words in long strings of Japanese text and failing spectacularly just made me laugh again.

polm23 · 6 years ago
OP here, I was not referring to it working JSFiddle, just the idea of Chrome shipping with a Japanese tokenizer. I've known people to do stuff like compile MeCab with emscripten to get Japanese tokenization in Electron, so the fact that the functionality was there already (even with lower quality) was surprising.
yftsui · 6 years ago
The fiddle doesn't work on Safari.

TypeError: Intl.v8BreakIterator is not a function. (In 'Intl.v8BreakIterator(['ja-JP'], {type:'word'})', 'Intl.v8BreakIterator' is undefined)

dlivingston · 6 years ago
Here’s a brief write-up [0] on techniques and software for Japanese tokenization.

[0]: http://www.solutions.asia/2016/10/japanese-tokenization.html...

simplicio · 6 years ago
I've been doing some work parsing Vietnamese text, which has the opposite problem. Compound words (which is most of the vocabulary) are broken up into their components by spaces, indistinquishable from the boundaries between words.
enos · 6 years ago
Is that why the name of the country is sometimes spelled with a space, "Viet nam"?
freddie_mercury · 6 years ago
Yes, that's how it is written in Vietnamese. To oversimplify: Vietnamese words are a collection of single syllables that are always separated by a space when writing.

"Viet Nam" is also, actually, the "official" English way to write it. (Check how the UN puts it on all their stuff.) However, most Europeans don't do that in their languages, so it usually gets written as Vietnam even by Vietnamese when they're writing European languages.

oh_sigh · 6 years ago
Given this property of Japanese text, is there wordplay associated with a string of characters with double/reverse meanings depending on how the characters are combined?
polm23 · 6 years ago
Not exactly double / reverse meanings, but there are several setences that are traditionally used to test Japanese tokenizers. Most of them are tongue twisters, like this one:

すももももももももの内

"Japanese plums (sumomo) and peaches are both kinds of peaches"

In speech intonation would make the word boundaries here clear, but in writing it looks odd.

This isn't a tongue twister, but tokenizers often fail on it:

外国人参政権は難しい問題だ

"Voting rights for foreigners is a complex problem."

外国人 (foreigner) / 参政権 (gov participation rights) is the right tokenization, but a common error is to parse it as 外国 (foreign) / 人参 (carrot) / 政権 (political power).

needle0 · 6 years ago
Yes. One that comes to mind is "[name]さんじゅうななさい" which can be either interpreted as "[name]-san, 17 years old" or "[name], 37 years old", depending on whether you interpret the さん(san) as an honorific or part of a number. (The sentence would usually be written in a combination of hiragana and kanji, but is intentionally written in all hiragana here to ensure ambiguity.)
needle0 · 6 years ago
Another one: "この先生きのこるには", which should be broken up at "この先/生きのこるには" to mean "To keep surviving going forward", but since 先生 (teacher) is such a common word, it jumps at your eyes and turns the sentence into "この先生/きのこるには" which means the nonsense "For this teacher to mushroom(verb)". Usually this doesn't happen because the "survive" part is written with more kanji as 生き残る, but here it is written in hiragana to make the きのこ(mushroom) part visible and further mess with lexing.

In both cases, some liberties have been taken with notation to intentionally encourage silly mis-readings; It happens much less often in ordinary text.