NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'. #204

gremur · 2022-01-21T19:35:15Z

It seems to me that nlp.add_pipe("textrank") must have "noun chunks" which will raise "NotImplementedError" for some language models where "noun chunks" have not been implemented. I've got "NotImplementedError" with "ru_core_news_lg" and "ru_core_news_sm" spacy models.

The proposal is to make the use of "noun chunks" optional to prevent such errors.

ceteri · 2022-03-06T22:21:33Z

Thank you @gremur
The 3.2.3 release should handle this better.
Could you test this change for us please?

gremur · 2022-03-07T09:20:36Z

Thank you @ceteri
I have updated pytextrank to latest version and run the application. Unfortunately I've got the same error: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'.

If you like to repeat my test I've used the following code:

import spacy
import pytextrank

# example text
text = """Цены фьючерсов на газ в Европе в понедельник выросли более чем вполовину, превысив 3500 долларов за тысячу кубометров, свидетельствуют данные лондонской биржи ICE.
Торги открылись с 2366,8 долларов за тысячу кубометров, но котировки сразу же пошли вверх. За час они подскочили почти на тысячу долларов. При этом расчетная цена на пятницу составляла 2170,2 доллара.
Европейский газовый рынок лихорадит уже много месяцев. Весной прошлого года цены колебались на уровне 250-300 долларов за тысячу кубов, к концу августа они выросли до 600 долларов, в октябре впервые перевалили за тысячу, а в декабре — за две тысячи. Затем последовал некоторый откат, но цены оставались стабильно высокими, чего не наблюдалось за всю историю функционирования газовых хабов в Европе."""

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("ru_core_news_lg")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(text)  # <--  The ERROR OCCURS HERE

# examine the top-ranked phrases in the document
for phrase in doc._.phrases:
    print(phrase.text)
    print(phrase.rank, phrase.count)
    print(phrase.chunks)

The error occurs at this line of code 'doc = nlp(text)'

…unks

working on #204

ceteri · 2022-03-07T19:18:52Z

Thank you @gremur, that code snippet really helps us debug.

This now extracts two entities: ICE and europa

It runs without exceptions, although I have a hunch that more structural work will be needed to get good results in cases where spaCy does not provide noun chunks.

ceteri · 2022-03-07T19:25:05Z

@gremur @tomaarsen:

I'm evaluating across the different algorithms we've implemented, using this script:

from icecream import ic
import spacy
import pytextrank

# example text
text = """Цены фьючерсов на газ в Европе в понедельник выросли более чем вполовину, превысив 3500 долларов за тысячу кубометров, свидетельствуют данные лондонской биржи ICE.
Торги открылись с 2366,8 долларов за тысячу кубометров, но котировки сразу же пошли вверх. За час они подскочили почти на тысячу долларов. При этом расчетная цена на пятницу составляла 2170,2 доллара.
Европейский газовый рынок лихорадит уже много месяцев. Весной прошлого года цены колебались на уровне 250-300 долларов за тысячу кубов, к концу августа они выросли до 600 долларов, в октябре впервые перевалили за тысячу, а в декабре — за две тысячи. Затем последовал некоторый откат, но цены оставались стабильно высокими, чего не наблюдалось за всю историю функционирования газовых хабов в Европе."""

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("ru_core_news_lg")

algos = [
    "textrank",
    "positionrank",
    "biasedtextrank",
    "topicrank",
]

for algo in algos:
    ic(algo)
    nlp.add_pipe(algo)

    doc = nlp(text)

    for phrase in doc._.phrases:
        ic(phrase.text)
        ic(phrase.rank, phrase.count)
        ic(phrase.chunks)

With the following results:

ic| algo: 'textrank'
ic| phrase.text: 'ICE'
ic| phrase.rank: 0.11248547080926787, phrase.count: 1
ic| phrase.chunks: [ICE]
ic| phrase.text: 'Европе'
ic| phrase.rank: 0.08997845270391501, phrase.count: 2
ic| phrase.chunks: [Европе, Европе]
ic| algo: 'positionrank'
ic| phrase.text: 'ICE'
ic| phrase.rank: 0.1105371344119576, phrase.count: 1
ic| phrase.chunks: [ICE]
ic| phrase.text: 'Европе'
ic| phrase.rank: 0.10165196118959414, phrase.count: 2
ic| phrase.chunks: [Европе, Европе]
ic| algo: 'biasedtextrank'
ic| phrase.text: 'ICE'
ic| phrase.rank: 0.11248547080926787, phrase.count: 1
ic| phrase.chunks: [ICE]
ic| phrase.text: 'Европе'
ic| phrase.rank: 0.08997845270391504, phrase.count: 2
ic| phrase.chunks: [Европе, Европе]
ic| algo: 'topicrank'
Traceback (most recent call last):
  File "ru.py", line 25, in <module>
    doc = nlp(text)
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/language.py", line 1019, in __call__
    error_handler(name, proc, [doc], e)
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/util.py", line 1618, in raise_error
    raise e
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/language.py", line 1014, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
  File "/Users/paco/src/pytextrank/pytextrank/topicrank.py", line 87, in __call__
    doc._.phrases = doc._.textrank.calc_textrank()
  File "/Users/paco/src/pytextrank/pytextrank/topicrank.py", line 333, in calc_textrank
    self.lemma_graph = self._construct_graph()
  File "/Users/paco/src/pytextrank/pytextrank/base.py", line 414, in _construct_graph
    g.add_nodes_from(self.node_list)
  File "/Users/paco/src/pytextrank/pytextrank/topicrank.py", line 280, in node_list
    clustered = [tuple(cluster) for cluster in self._cluster(candidates)]
  File "/Users/paco/src/pytextrank/pytextrank/topicrank.py", line 211, in _cluster
    pairwise_dist = pdist(matrix, "jaccard")
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/scipy/spatial/distance.py", line 2231, in pdist
    raise ValueError('A 2-dimensional array must be passed.')
ValueError: A 2-dimensional array must be passed.

TopicRank has an exception, though the other algorithms produce results – albeit limited results.

@tomaarsen it'd be possible for _cluster() to return zero entities – although perhaps more general case if we have node_list() catch this, so that other methods (e.g., edge_list()) also handle this case correctly?

@gremur overall, would you expect many more entities to be extracted?

gremur · 2022-03-08T09:30:25Z

Perhaps I did not provide the best example of text, but some 'noun chunks' can be found (marked below in bold italic).

Цены фьючерсов на газ в Европе в понедельник выросли более чем вполовину, превысив 3500 долларов за тысячу кубометров, свидетельствуют данные лондонской биржи ICE.
Торги открылись с 2366,8 долларов за тысячу кубометров, но котировки сразу же пошли вверх. За час они подскочили почти на тысячу долларов. При этом расчетная цена на пятницу составляла 2170,2 доллара.
Европейский газовый рынок лихорадит уже много месяцев. Весной прошлого года цены колебались на уровне 250-300 долларов за тысячу кубов, к концу августа они выросли до 600 долларов, в октябре впервые перевалили за тысячу, а в декабре — за две тысячи. Затем последовал некоторый откат, но цены оставались стабильно высокими, чего не наблюдалось за всю историю функционирования газовых хабов в Европе.

k0286 · 2022-07-08T03:06:58Z

same problem happens in the Chinese. Is there any progress?

ceteri · 2022-07-08T09:19:35Z

hi @k0286,

based on @gremur's feedback we made the need for noun chunks optional. so, yes this specific request has been completed. even so, the results depend on the quality of the other pipeline components prior to textgraph analysis.

in the case of mandarin, as far as i'm aware we've never had any feedback yet about use of pytextrank with mandarin – and in particular, not from a native speaker. that would be interesting. how are the segmentation and PoS tagging handled upstream by spaCy ? also, are there any wordnet-like resources for mandarin?

if you've got a PR we're ready to work on integration!

gremur · 2022-07-12T11:35:34Z

Thank you!

ceteri added the bug label Mar 6, 2022

ceteri added a commit that referenced this issue Mar 6, 2022

handle missing in some language models #204

18f0f05

ceteri self-assigned this Mar 6, 2022

ceteri added a commit that referenced this issue Mar 7, 2022

working on #204, better support for language that do not have noun ch…

46c62ad

…unks

ceteri mentioned this issue Mar 7, 2022

working on #204 #212

Merged

ceteri added a commit that referenced this issue Mar 7, 2022

Merge pull request #212 from DerwenAI/no_noun_chunks

9ab6450

working on #204

tomaarsen mentioned this issue Jul 8, 2022

Prevent exception on TopicRank when there are no noun_chunks #219

Merged

gremur closed this as completed Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'. #204

NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'. #204

gremur commented Jan 21, 2022 •

edited

Loading

ceteri commented Mar 6, 2022

gremur commented Mar 7, 2022 •

edited

Loading

ceteri commented Mar 7, 2022

ceteri commented Mar 7, 2022

gremur commented Mar 8, 2022

k0286 commented Jul 8, 2022

ceteri commented Jul 8, 2022

gremur commented Jul 12, 2022

NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'. #204

NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'. #204

Comments

gremur commented Jan 21, 2022 • edited Loading

ceteri commented Mar 6, 2022

gremur commented Mar 7, 2022 • edited Loading

ceteri commented Mar 7, 2022

ceteri commented Mar 7, 2022

gremur commented Mar 8, 2022

k0286 commented Jul 8, 2022

ceteri commented Jul 8, 2022

gremur commented Jul 12, 2022

gremur commented Jan 21, 2022 •

edited

Loading

gremur commented Mar 7, 2022 •

edited

Loading