Abstract:
© Springer Nature Switzerland AG 2020. This paper describes how to automatically recognize parts of speech and other grammatical categories of a word such as gender and number. Unlike some previous works, the vector of syntactic bigram frequencies (including the considered word) is used as the source data for recognition of parts of speech and the grammatical categories. Data on frequencies of syntactic bigrams were obtained from the Russian sub-corpus of Google Books Ngram. We used part–of–speech tags available in Google Books Ngram, as well as data on parts of speech and grammatical categories of words obtained from the electronic dictionary Open Corpora. To train the model, we selected words from the list of 100.000 most frequent words that don’t have homonyms and are found in both Google Books Ngram and Open Corpora. A multilayer perceptron with an output layer of the softmax type was used as a recognizer. The vector of frequencies of syntactic bigrams including the test word and one of the 10.000 most frequent words was at the inputs of the network. The neural network was trained by the criterion of minimum cross–entropy. When recognizing parts of speech on the test sample, the average recognition accuracy was 99.1%. Nouns and verbs were recognized best of all (with the accuracy of 99.77% and 99.62%, respectively). The recognition accuracy of the word number was 99.61%. The achieved recognition accuracy of the word gender was substantially lower, it was just 91.9%.