The history of corpus linguistics (on the example of the english language corpora)

Gatiyatullina G.M.; Solnyshkina M.I.

dc.contributor.author	Solnyshkina M.I.
dc.contributor.author	Gatiyatullina G.M.
dc.date.accessioned	2021-02-25T20:52:40Z
dc.date.available	2021-02-25T20:52:40Z
dc.date.issued	2020
dc.identifier.issn	1998-6645
dc.identifier.uri	https://dspace.kpfu.ru/xmlui/handle/net/162555
dc.description.abstract	© 2020 Tomsk State University. All rights reserved. The aim of the research is to review the milestones in the development of corpus linguistics and present an original classification of the main periods in formation and development of English-language corpora which includes the following four periods: (a) the “pre-electronic” period or the period of text archives which lasted for over several centuries and finished in the 1960s; (b) “the first generation” covers the period from the 1960s to the mid-1990s; (c) “the second generation” period of megacorpora corresponds to the last decade of the 20th century; (d) the third generation period of gigacorpora started in the mid-2000s. The pre-electronic corpora and concordances lacked a unified system of text collection, views on representative size, and sources of corpora. In this period, there were developed the basic principles of concordance collection, the KWIC system, lemmatization. The first generation corpora were mostly compiled for the study of certain genres and/or speech of certain groups of people. These corpora typically contained texts with a limited number of tokens, usually no more than 2,000. Among the most significant achievements of that period are The Brown Corpus and the London-Oslo-Bergen corpus, the first reference corpora, which were used for lexical and grammatical studies of “language in use”, the first concordance software (CLOC, COCOA), and the first automatic tagging software (TAGGIT). By the early 1990s, the following terms were introduced, specified and defined: “corpus linguistics”, “metatext”, “tagging”, “concordancer”, “POS-tagging”, “tokenization”, “segmentation”, “parsing”. The problem of a standardized corpus, its compilation, and tagging were addressed in the project of Text Encoding Initiative (1987). The annotation patterns of that period began requiring POS, syntactic, semantic, and other tagging. Concordances of the mid-2000s became faster and more user friendly. Representativeness in corpora was achieved by the presence of texts of spoken and written speech in various communicative events. Therefore, the referential corpora of the second generation (BNC, ANC) represent the national language with a wide range of both written and spoken genres in many territorial dialects. The size of the third generation corpora or gigacorpora (COCA, Google Books) was increased to several billion tokens, and they became dynamic. The installed software enables tracking the form, meaning, and use of words and n-grams in written and spoken texts in a number of languages covering several historical periods. Modern concordances are also tools for compilation of small subcorpora and contrasting the obtained results with those of the larger corpora (BNC, COCA).
dc.relation.ispartofseries	Vestnik Tomskogo Gosudarstvennogo Universiteta, Filologiya
dc.subject	Corpus classification
dc.subject	Corpus generations
dc.subject	Corpus linguistics
dc.subject	History of linguistics
dc.subject	Text corpora
dc.title	The history of corpus linguistics (on the example of the english language corpora)
dc.type	Article
dc.relation.ispartofseries-volume	63
dc.collection	Публикации сотрудников КФУ
dc.relation.startpage	132
dc.source.id	SCOPUS19986645-2020-63-SID85087067769

Files in this item

Name: SCOPUS19986645-20 ...

Size: 56.54Kb

Format: PDF

View/Open

This item appears in the following Collection(s)

Публикации сотрудников КФУ Scopus [24551]
Коллекция содержит публикации сотрудников Казанского федерального (до 2010 года Казанского государственного) университета, проиндексированные в БД Scopus, начиная с 1970г.

Show simple item record

Search DSpace

Advanced Search

Browse

All of Kazan Federal University Digital Repository
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

My Account

Statistics

View Usage Statistics

The history of corpus linguistics (on the example of the english language corpora)

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of Kazan Federal University Digital Repository

This Collection

My Account

Statistics