Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get word break type #2755

Closed
ccleve opened this issue Oct 17, 2022 · 9 comments · Fixed by #3139
Closed

Get word break type #2755

ccleve opened this issue Oct 17, 2022 · 9 comments · Fixed by #3139
Assignees
Labels
C-segmentation Component: Segmentation question Unresolved questions; type unclear T-core Type: Required functionality U-ecma402 User: ECMA-402 compatibility

Comments

@ccleve
Copy link

ccleve commented Oct 17, 2022

When you iterate through text using the WordBreakIterator, you get the boundaries of words, spaces, punctuation, etc. It does not appear to tell you what kind of token or break that is has found.

The C-language version of ICU has a function on the iterator called getRuleStatus() that returns an enum that describes the last break it found. The documentation is here:

https://unicode-org.github.io/icu/userguide/boundaryanalysis/

The function getRuleStatus() returns an enum giving additional information on the text preceding the last break position found. Using this value, it is possible to distinguish between numbers, words, words containing kana characters, words containing ideographic characters, and non-word characters, such as spaces or punctuation.

Is there a similar function on WordBreakSegmenter / WordBreakIteratorUtf8 that I have missed?

@sffc
Copy link
Member

sffc commented Oct 17, 2022

Good question; I think this is needed for 402 compatibility, too. @makotokato @aethanyc ?

@makotokato
Copy link
Member

Does EMCA-402 define word break type? At least, we may have to have that WordBreakSegmenter's options (isWordLike?).

@sffc
Copy link
Member

sffc commented Oct 18, 2022

Yeah, it looks like in ECMA-402 we landed on a single coarse-grained isWordLike option.

https://tc39.es/ecma402/#sec-createsegmentdataobject

@ccleve Is the isWordLike boolean sufficient for your use case, or do you need the more expressive version in ICU4C? If so, could you share more on your use case for why you need the increased granularity?

@ccleve
Copy link
Author

ccleve commented Oct 18, 2022

@sffc My use case is a search engine. We're tokenizing terms in text so we can search for them. We need to handle many languages, and I'd like to offer UAX 29 as an option for word breaking. (We're also doing character normalization in various forms.)

.isWordLike() isn't quite enough because we need special handling for certain punctuation marks like hyphens and apostrophes. For example, "load-bearing" should be tokenized as "load", "bearing", and "loadbearing". I need to know that the hyphen is a hyphen. "O'Brien" should be "o", "brien" and "obrien". In french, "l'espace" should just be "espace".

Similarly, "ABC123" is a mixed letter-number term, and for certain apps (like part numbers) it needs to be split: "abc123", "abc", "123". (Although honestly, we probably would not need to use UAX 29 on part numbers.)

If we tokenize queries the same way we tokenize text, then "+", "-", "(" and ")" are important.

All of this can be handled with a second pass over the tokens to classify each of them. But if we have to iterate over every character of every token to do this, it will have a performance impact. The more we can know about a token in advance the better.

In the hand-rolled lexers I've written in the past, I've returned token type WORD (letters only), WORD_WITH_NUMBERS, WORD_WITH_APOSTROPHE, HYPHENATED_WORD, PLUS, MINUS, etc. I know this goes beyond the scope of a unicode wordbreaker, but I wanted to show the use case.

Something like WORD (letters only), MIXED WORD (letters, numbers, apostrophes), WHITESPACE, and PUNCTUATION should do the trick.

@sffc
Copy link
Member

sffc commented Oct 19, 2022

Here is the mapping from ICU4C states to isWordLike in V8:

https://github.com/v8/v8/blob/fd3a2291f94e6e8b1a156ac68e1aa7301b41d858/src/objects/js-segments.cc#L93

@sffc
Copy link
Member

sffc commented Oct 19, 2022

From a discussion with @markusicu: The distinction between different types of "words" seems specific to a particular use case/search engine; and the distinction between punctuation and symbols can be a little fuzzy. In some cases, like "don't", punctuation becomes part of the word. This use case could be performed by checking the character property of the first 1-2 characters in the segment.

We could explore making 3 categories: word-like, punctuation-like, and whitespace, but going further than that is tricky.

@sffc sffc added U-ecma402 User: ECMA-402 compatibility question Unresolved questions; type unclear T-core Type: Required functionality C-segmentation Component: Segmentation labels Dec 1, 2022
@sffc
Copy link
Member

sffc commented Dec 1, 2022

@makotokato @aethanyc Is this needed for Mozilla Intl.Segmenter? If so we should schedule it in a near milestone.

@aethanyc
Copy link
Contributor

aethanyc commented Dec 4, 2022

@makotokato @aethanyc Is this needed for Mozilla Intl.Segmenter? If so we should schedule it in a near milestone.

Yes, we need to expose word break type in order to implement Intl.Segmeter's isWordLike property. Firefox is not in a rush to implement the JS engine in the initial integration though. We are focusing on integrating line and word segmenter into the layout engine.

@sffc sffc added this to the ICU4X 1.2 milestone Dec 5, 2022
@sffc
Copy link
Member

sffc commented Dec 5, 2022

I'm going to schedule for ICU4X 1.2, which we discussed would be the Segmenter 1.0 release, which should include this change. Changes for ICU4X 1.2 should be in around March 2023. Could you take the issue @aethanyc or @makotokato?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation question Unresolved questions; type unclear T-core Type: Required functionality U-ecma402 User: ECMA-402 compatibility
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants