-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get word break type #2755
Comments
Good question; I think this is needed for 402 compatibility, too. @makotokato @aethanyc ? |
Does EMCA-402 define word break type? At least, we may have to have that WordBreakSegmenter's options ( |
Yeah, it looks like in ECMA-402 we landed on a single coarse-grained https://tc39.es/ecma402/#sec-createsegmentdataobject @ccleve Is the |
@sffc My use case is a search engine. We're tokenizing terms in text so we can search for them. We need to handle many languages, and I'd like to offer UAX 29 as an option for word breaking. (We're also doing character normalization in various forms.) .isWordLike() isn't quite enough because we need special handling for certain punctuation marks like hyphens and apostrophes. For example, "load-bearing" should be tokenized as "load", "bearing", and "loadbearing". I need to know that the hyphen is a hyphen. "O'Brien" should be "o", "brien" and "obrien". In french, "l'espace" should just be "espace". Similarly, "ABC123" is a mixed letter-number term, and for certain apps (like part numbers) it needs to be split: "abc123", "abc", "123". (Although honestly, we probably would not need to use UAX 29 on part numbers.) If we tokenize queries the same way we tokenize text, then "+", "-", "(" and ")" are important. All of this can be handled with a second pass over the tokens to classify each of them. But if we have to iterate over every character of every token to do this, it will have a performance impact. The more we can know about a token in advance the better. In the hand-rolled lexers I've written in the past, I've returned token type WORD (letters only), WORD_WITH_NUMBERS, WORD_WITH_APOSTROPHE, HYPHENATED_WORD, PLUS, MINUS, etc. I know this goes beyond the scope of a unicode wordbreaker, but I wanted to show the use case. Something like WORD (letters only), MIXED WORD (letters, numbers, apostrophes), WHITESPACE, and PUNCTUATION should do the trick. |
Here is the mapping from ICU4C states to isWordLike in V8: |
From a discussion with @markusicu: The distinction between different types of "words" seems specific to a particular use case/search engine; and the distinction between punctuation and symbols can be a little fuzzy. In some cases, like "don't", punctuation becomes part of the word. This use case could be performed by checking the character property of the first 1-2 characters in the segment. We could explore making 3 categories: word-like, punctuation-like, and whitespace, but going further than that is tricky. |
@makotokato @aethanyc Is this needed for Mozilla Intl.Segmenter? If so we should schedule it in a near milestone. |
Yes, we need to expose word break type in order to implement Intl.Segmeter's |
I'm going to schedule for ICU4X 1.2, which we discussed would be the Segmenter 1.0 release, which should include this change. Changes for ICU4X 1.2 should be in around March 2023. Could you take the issue @aethanyc or @makotokato? |
When you iterate through text using the WordBreakIterator, you get the boundaries of words, spaces, punctuation, etc. It does not appear to tell you what kind of token or break that is has found.
The C-language version of ICU has a function on the iterator called
getRuleStatus()
that returns an enum that describes the last break it found. The documentation is here:https://unicode-org.github.io/icu/userguide/boundaryanalysis/
Is there a similar function on WordBreakSegmenter / WordBreakIteratorUtf8 that I have missed?
The text was updated successfully, but these errors were encountered: