Readit News logoReadit News
rmccrear commented on Edge AI for Beginners   github.com/microsoft/edge... · Posted by u/bakigul
rmccrear · 5 months ago
I clicked hoping the models would be available in the “Edge” browser.
rmccrear commented on Show HN: Mandarin Word Segmenter with Translation   mandobot.netlify.app/... · Posted by u/routerl
routerl · a year ago
Thanks for the kind words!

I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.

[0] https://github.com/fxsjy/jieba

rmccrear · a year ago
Great project! It's fascinating how hard segmentation is and how many approaches there are. I thought I'd mention a trick that can let you segment without a backend. When you double click Chinese text in the browser, it will highlight an entire word. For example, try double clicking on the text here: 一步登天:走一步就到天堂美好境地。 It highlights/segments the first 4 characters as a chengyu, and the others as one or two character words. I haven't been able to discover what method Apple and Microsoft use to segment, but it seems to do a good job. You can even use JavaScript's Range.expand() function to do this programmatically. I once even made a little JS library that can run in the background and segment words on a page.

u/rmccrear

KarmaCake day2February 8, 2025View Original