Get word break type #2755

ccleve · 2022-10-17T22:55:08Z

When you iterate through text using the WordBreakIterator, you get the boundaries of words, spaces, punctuation, etc. It does not appear to tell you what kind of token or break that is has found.

The C-language version of ICU has a function on the iterator called getRuleStatus() that returns an enum that describes the last break it found. The documentation is here:

https://unicode-org.github.io/icu/userguide/boundaryanalysis/

The function getRuleStatus() returns an enum giving additional information on the text preceding the last break position found. Using this value, it is possible to distinguish between numbers, words, words containing kana characters, words containing ideographic characters, and non-word characters, such as spaces or punctuation.

Is there a similar function on WordBreakSegmenter / WordBreakIteratorUtf8 that I have missed?

The text was updated successfully, but these errors were encountered:

sffc · 2022-10-17T23:19:55Z

Good question; I think this is needed for 402 compatibility, too. @makotokato @aethanyc ?

makotokato · 2022-10-18T11:39:50Z

Does EMCA-402 define word break type? At least, we may have to have that WordBreakSegmenter's options (isWordLike?).

sffc · 2022-10-18T18:18:18Z

Yeah, it looks like in ECMA-402 we landed on a single coarse-grained isWordLike option.

https://tc39.es/ecma402/#sec-createsegmentdataobject

@ccleve Is the isWordLike boolean sufficient for your use case, or do you need the more expressive version in ICU4C? If so, could you share more on your use case for why you need the increased granularity?

ccleve · 2022-10-18T19:40:45Z

@sffc My use case is a search engine. We're tokenizing terms in text so we can search for them. We need to handle many languages, and I'd like to offer UAX 29 as an option for word breaking. (We're also doing character normalization in various forms.)

.isWordLike() isn't quite enough because we need special handling for certain punctuation marks like hyphens and apostrophes. For example, "load-bearing" should be tokenized as "load", "bearing", and "loadbearing". I need to know that the hyphen is a hyphen. "O'Brien" should be "o", "brien" and "obrien". In french, "l'espace" should just be "espace".

Similarly, "ABC123" is a mixed letter-number term, and for certain apps (like part numbers) it needs to be split: "abc123", "abc", "123". (Although honestly, we probably would not need to use UAX 29 on part numbers.)

If we tokenize queries the same way we tokenize text, then "+", "-", "(" and ")" are important.

All of this can be handled with a second pass over the tokens to classify each of them. But if we have to iterate over every character of every token to do this, it will have a performance impact. The more we can know about a token in advance the better.

In the hand-rolled lexers I've written in the past, I've returned token type WORD (letters only), WORD_WITH_NUMBERS, WORD_WITH_APOSTROPHE, HYPHENATED_WORD, PLUS, MINUS, etc. I know this goes beyond the scope of a unicode wordbreaker, but I wanted to show the use case.

Something like WORD (letters only), MIXED WORD (letters, numbers, apostrophes), WHITESPACE, and PUNCTUATION should do the trick.

sffc · 2022-10-19T00:22:08Z

Here is the mapping from ICU4C states to isWordLike in V8:

https://github.com/v8/v8/blob/fd3a2291f94e6e8b1a156ac68e1aa7301b41d858/src/objects/js-segments.cc#L93

sffc · 2022-10-19T21:56:00Z

From a discussion with @markusicu: The distinction between different types of "words" seems specific to a particular use case/search engine; and the distinction between punctuation and symbols can be a little fuzzy. In some cases, like "don't", punctuation becomes part of the word. This use case could be performed by checking the character property of the first 1-2 characters in the segment.

We could explore making 3 categories: word-like, punctuation-like, and whitespace, but going further than that is tricky.

sffc · 2022-12-01T19:25:41Z

@makotokato @aethanyc Is this needed for Mozilla Intl.Segmenter? If so we should schedule it in a near milestone.

aethanyc · 2022-12-04T20:30:00Z

@makotokato @aethanyc Is this needed for Mozilla Intl.Segmenter? If so we should schedule it in a near milestone.

Yes, we need to expose word break type in order to implement Intl.Segmeter's isWordLike property. Firefox is not in a rush to implement the JS engine in the initial integration though. We are focusing on integrating line and word segmenter into the layout engine.

sffc · 2022-12-05T08:58:56Z

I'm going to schedule for ICU4X 1.2, which we discussed would be the Segmenter 1.0 release, which should include this change. Changes for ICU4X 1.2 should be in around March 2023. Could you take the issue @aethanyc or @makotokato?

sffc added U-ecma402 User: ECMA-402 compatibility question Unresolved questions; type unclear T-core Type: Required functionality C-segmentation Component: Segmentation labels Dec 1, 2022

sffc added this to the ICU4X 1.2 milestone Dec 5, 2022

makotokato self-assigned this Dec 6, 2022

sffc mentioned this issue Dec 20, 2022

Move Segmenter to Components #2259

Closed

22 tasks

makotokato mentioned this issue Feb 22, 2023

Implement rule status for word break boundary. #3139

Merged

makotokato closed this as completed in #3139 Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get word break type #2755

Get word break type #2755

ccleve commented Oct 17, 2022

sffc commented Oct 17, 2022

makotokato commented Oct 18, 2022

sffc commented Oct 18, 2022 •

edited

Loading

ccleve commented Oct 18, 2022

sffc commented Oct 19, 2022

sffc commented Oct 19, 2022

sffc commented Dec 1, 2022

aethanyc commented Dec 4, 2022

sffc commented Dec 5, 2022

Get word break type #2755

Get word break type #2755

Comments

ccleve commented Oct 17, 2022

sffc commented Oct 17, 2022

makotokato commented Oct 18, 2022

sffc commented Oct 18, 2022 • edited Loading

ccleve commented Oct 18, 2022

sffc commented Oct 19, 2022

sffc commented Oct 19, 2022

sffc commented Dec 1, 2022

aethanyc commented Dec 4, 2022

sffc commented Dec 5, 2022

sffc commented Oct 18, 2022 •

edited

Loading