Электронный архив

Recognition of parts of speech using the vector of bigram frequencies

Показать сокращенную информацию

dc.contributor.author Khristoforov S.
dc.contributor.author Bochkarev V.
dc.contributor.author Shevlyakova A.
dc.date.accessioned 2021-02-25T06:54:12Z
dc.date.available 2021-02-25T06:54:12Z
dc.date.issued 2020
dc.identifier.issn 1865-0929
dc.identifier.uri https://dspace.kpfu.ru/xmlui/handle/net/161405
dc.description.abstract © Springer Nature Switzerland AG 2020. This paper describes how to automatically recognize parts of speech and other grammatical categories of a word such as gender and number. Unlike some previous works, the vector of syntactic bigram frequencies (including the considered word) is used as the source data for recognition of parts of speech and the grammatical categories. Data on frequencies of syntactic bigrams were obtained from the Russian sub-corpus of Google Books Ngram. We used part–of–speech tags available in Google Books Ngram, as well as data on parts of speech and grammatical categories of words obtained from the electronic dictionary Open Corpora. To train the model, we selected words from the list of 100.000 most frequent words that don’t have homonyms and are found in both Google Books Ngram and Open Corpora. A multilayer perceptron with an output layer of the softmax type was used as a recognizer. The vector of frequencies of syntactic bigrams including the test word and one of the 10.000 most frequent words was at the inputs of the network. The neural network was trained by the criterion of minimum cross–entropy. When recognizing parts of speech on the test sample, the average recognition accuracy was 99.1%. Nouns and verbs were recognized best of all (with the accuracy of 99.77% and 99.62%, respectively). The recognition accuracy of the word number was 99.61%. The achieved recognition accuracy of the word gender was substantially lower, it was just 91.9%.
dc.relation.ispartofseries Communications in Computer and Information Science
dc.subject Bigram frequency
dc.subject Google Books Ngram
dc.subject Neural networks
dc.subject Part of speech recognition
dc.title Recognition of parts of speech using the vector of bigram frequencies
dc.type Conference Paper
dc.relation.ispartofseries-volume 1086
dc.collection Публикации сотрудников КФУ
dc.relation.startpage 1
dc.source.id SCOPUS18650929-2020-1086-SID85087545397


Файлы в этом документе

Данный элемент включен в следующие коллекции

  • Публикации сотрудников КФУ Scopus [24551]
    Коллекция содержит публикации сотрудников Казанского федерального (до 2010 года Казанского государственного) университета, проиндексированные в БД Scopus, начиная с 1970г.

Показать сокращенную информацию

Поиск в электронном архиве


Расширенный поиск

Просмотр

Моя учетная запись

Статистика