Recognition of parts of speech using the vector of bigram frequencies

Bochkarev V.; Shevlyakova A.; Khristoforov S.

dc.contributor.author	Khristoforov S.
dc.contributor.author	Bochkarev V.
dc.contributor.author	Shevlyakova A.
dc.date.accessioned	2021-02-25T06:54:12Z
dc.date.available	2021-02-25T06:54:12Z
dc.date.issued	2020
dc.identifier.issn	1865-0929
dc.identifier.uri	https://dspace.kpfu.ru/xmlui/handle/net/161405
dc.description.abstract	© Springer Nature Switzerland AG 2020. This paper describes how to automatically recognize parts of speech and other grammatical categories of a word such as gender and number. Unlike some previous works, the vector of syntactic bigram frequencies (including the considered word) is used as the source data for recognition of parts of speech and the grammatical categories. Data on frequencies of syntactic bigrams were obtained from the Russian sub-corpus of Google Books Ngram. We used part–of–speech tags available in Google Books Ngram, as well as data on parts of speech and grammatical categories of words obtained from the electronic dictionary Open Corpora. To train the model, we selected words from the list of 100.000 most frequent words that don’t have homonyms and are found in both Google Books Ngram and Open Corpora. A multilayer perceptron with an output layer of the softmax type was used as a recognizer. The vector of frequencies of syntactic bigrams including the test word and one of the 10.000 most frequent words was at the inputs of the network. The neural network was trained by the criterion of minimum cross–entropy. When recognizing parts of speech on the test sample, the average recognition accuracy was 99.1%. Nouns and verbs were recognized best of all (with the accuracy of 99.77% and 99.62%, respectively). The recognition accuracy of the word number was 99.61%. The achieved recognition accuracy of the word gender was substantially lower, it was just 91.9%.
dc.relation.ispartofseries	Communications in Computer and Information Science
dc.subject	Bigram frequency
dc.subject	Google Books Ngram
dc.subject	Neural networks
dc.subject	Part of speech recognition
dc.title	Recognition of parts of speech using the vector of bigram frequencies
dc.type	Conference Paper
dc.relation.ispartofseries-volume	1086
dc.collection	Публикации сотрудников КФУ
dc.relation.startpage	1
dc.source.id	SCOPUS18650929-2020-1086-SID85087545397

Файлы в этом документе

Имя: SCOPUS18650929-20 ...

Размер: 85.07Kb

Формат: PDF

Открыть

Данный элемент включен в следующие коллекции

Публикации сотрудников КФУ Scopus [24551]
Коллекция содержит публикации сотрудников Казанского федерального (до 2010 года Казанского государственного) университета, проиндексированные в БД Scopus, начиная с 1970г.

Показать сокращенную информацию

Поиск в электронном архиве

Расширенный поиск

Просмотр

Весь электронный архив
Коллекция

Моя учетная запись

Статистика

Просмотр статистики использования

Recognition of parts of speech using the vector of bigram frequencies

Файлы в этом документе

Данный элемент включен в следующие коллекции

Поиск в электронном архиве

Просмотр

Весь электронный архив

Коллекция

Моя учетная запись

Статистика