dc.description.abstract |
© 2016 IEEE. There are several well-known corpus management systems (Sketch Engine, Manatee, EXMARaLDA, etc.). The system presented in this article has search functionalities comparable to those. However, it also takes into account certain specifics of Turkic languages. The Tatar corpus management system (http://corpus.antat.ru) is specifically designed to work with Turkic linguistic corpora. Functionality offered by the corpus management system includes search of lexical units, morphological and lexical search, search of syntactic units, search of the n-gram based on grammar and others. The semantic model of the Tatar language data representation is the core of the system. The search is performed using open source tools (database management system MariaDB, Redis data store). The Tatar language has a complicated agglutinative morphology; and we consider the system of grammatical categories represented in grammatical annotation of the Tatar corpus as a key to semantics of the language. Selecting and combining grammatical, lexical and other parameters of a query, we may get certain sets of semantic samples from semantically unstructured corpus data. The main task of our research is detecting and describing a class of grammatically conditioned semantic phenomena and developing a system of queries to the corpus for extraction of these semantic phenomena. Experiments with queries to the Tatar corpus show that semantically relevant combinations of query parameters may differ by level of complexity. The results of the work may be used for document clustering and classification, as well as for Tatar grammar building and other purposes. |
|