Kazan Federal University Digital Repository

The history of corpus linguistics (on the example of the english language corpora)

Show simple item record

dc.contributor.author Solnyshkina M.I.
dc.contributor.author Gatiyatullina G.M.
dc.date.accessioned 2021-02-25T20:52:40Z
dc.date.available 2021-02-25T20:52:40Z
dc.date.issued 2020
dc.identifier.issn 1998-6645
dc.identifier.uri https://dspace.kpfu.ru/xmlui/handle/net/162555
dc.description.abstract © 2020 Tomsk State University. All rights reserved. The aim of the research is to review the milestones in the development of corpus linguistics and present an original classification of the main periods in formation and development of English-language corpora which includes the following four periods: (a) the “pre-electronic” period or the period of text archives which lasted for over several centuries and finished in the 1960s; (b) “the first generation” covers the period from the 1960s to the mid-1990s; (c) “the second generation” period of megacorpora corresponds to the last decade of the 20th century; (d) the third generation period of gigacorpora started in the mid-2000s. The pre-electronic corpora and concordances lacked a unified system of text collection, views on representative size, and sources of corpora. In this period, there were developed the basic principles of concordance collection, the KWIC system, lemmatization. The first generation corpora were mostly compiled for the study of certain genres and/or speech of certain groups of people. These corpora typically contained texts with a limited number of tokens, usually no more than 2,000. Among the most significant achievements of that period are The Brown Corpus and the London-Oslo-Bergen corpus, the first reference corpora, which were used for lexical and grammatical studies of “language in use”, the first concordance software (CLOC, COCOA), and the first automatic tagging software (TAGGIT). By the early 1990s, the following terms were introduced, specified and defined: “corpus linguistics”, “metatext”, “tagging”, “concordancer”, “POS-tagging”, “tokenization”, “segmentation”, “parsing”. The problem of a standardized corpus, its compilation, and tagging were addressed in the project of Text Encoding Initiative (1987). The annotation patterns of that period began requiring POS, syntactic, semantic, and other tagging. Concordances of the mid-2000s became faster and more user friendly. Representativeness in corpora was achieved by the presence of texts of spoken and written speech in various communicative events. Therefore, the referential corpora of the second generation (BNC, ANC) represent the national language with a wide range of both written and spoken genres in many territorial dialects. The size of the third generation corpora or gigacorpora (COCA, Google Books) was increased to several billion tokens, and they became dynamic. The installed software enables tracking the form, meaning, and use of words and n-grams in written and spoken texts in a number of languages covering several historical periods. Modern concordances are also tools for compilation of small subcorpora and contrasting the obtained results with those of the larger corpora (BNC, COCA).
dc.relation.ispartofseries Vestnik Tomskogo Gosudarstvennogo Universiteta, Filologiya
dc.subject Corpus classification
dc.subject Corpus generations
dc.subject Corpus linguistics
dc.subject History of linguistics
dc.subject Text corpora
dc.title The history of corpus linguistics (on the example of the english language corpora)
dc.type Article
dc.relation.ispartofseries-volume 63
dc.collection Публикации сотрудников КФУ
dc.relation.startpage 132
dc.source.id SCOPUS19986645-2020-63-SID85087067769


Files in this item

This item appears in the following Collection(s)

  • Публикации сотрудников КФУ Scopus [24551]
    Коллекция содержит публикации сотрудников Казанского федерального (до 2010 года Казанского государственного) университета, проиндексированные в БД Scopus, начиная с 1970г.

Show simple item record

Search DSpace


Advanced Search

Browse

My Account

Statistics