dc.contributor.author |
Bochkarev V.V. |
|
dc.contributor.author |
Khristoforov S.V. |
|
dc.contributor.author |
Shevlyakova A.V. |
|
dc.date.accessioned |
2021-02-25T06:51:04Z |
|
dc.date.available |
2021-02-25T06:51:04Z |
|
dc.date.issued |
2020 |
|
dc.identifier.issn |
0302-9743 |
|
dc.identifier.uri |
https://dspace.kpfu.ru/xmlui/handle/net/161085 |
|
dc.description.abstract |
© 2020, Springer Nature Switzerland AG. This paper describes how to build a recognizer to identify named entities that occur in the Google Books Ngram corpus. In the previous studies, the text was usually input to the recognizer to solve the task of named entities recognition. In this paper, the decision is made based on the analysis of the word co-occurrence statistics. The recognizer is a neural network. A vector of frequencies of bigrams or syntactic bigrams including the studied word is fed at the input. The task is to recognize named entities denoted by one word. However, the proposed method can be further applied to recognize two- or multi-word named entities. The recognition error probability obtained on the test sample of 10 thousand words, which are free from homonymy, was 2.71% (F1-score is 0.963). Solving the problem of word classification in Google Books Ngram will allow one to create large dictionaries of named entities that will improve recognition quality of named entities in texts by existing algorithms. |
|
dc.relation.ispartofseries |
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|
dc.subject |
Google Books Ngram |
|
dc.subject |
N-grams frequencies |
|
dc.subject |
Named entities recognition |
|
dc.subject |
Neural networks |
|
dc.subject |
Syntactic bigrams |
|
dc.title |
Recognition of Named Entities in the Russian Subcorpus Google Books Ngram |
|
dc.type |
Conference Paper |
|
dc.relation.ispartofseries-volume |
12469 LNAI |
|
dc.collection |
Публикации сотрудников КФУ |
|
dc.relation.startpage |
17 |
|
dc.source.id |
SCOPUS03029743-2020-12469-SID85092929783 |
|