코퍼스 용어(Corpus Glossaries)


추가할 용어가 있으면 연락주시기 바랍니(클릭).
Please let us know if you would like to add new terms (click here).

영어 코퍼스 용어(Corpus Glossaries in English)

A
  • Attributive adjective
    • “An attributive adjective is an adjective which is used before a noun, such as a rich man.” (CELS, Univ. of Bham)

C
  • Chunk
    • “A sequence of words in text that constitutes a non-recursive, elementary grouping of a particular syntactic category (e.g. nominal, prepositional).” (The Oxford Handbook of Computational Linguistics)
  • Clause
    • “A clause may be finite or non-finite. A finite clause normally consists of at least a subject and a verb, as in the little girl laughed. A non-finite clause begins with a to-infinitive, an ‘-ing’ form or some other non-finite part of a verb e.g. She suggested that we should leave.” (CELS, Univ. of Bham)
  • Cluster
    • “1. A term used to describe any group of words in sequence (for example as used in Word Smith Tools). Also referred to as lexical bundles (Biber et al. 1999, pp. 993-994). 2. A set of texts which statistically share similar linguistic features” (Baker, Hardie, and McEnery, 2006, p. 34)
  • Cluster analysis
    • “Clustering is the grouping of similar objects (Willett, 1988) and a cluster analysis is a multivariate statistical technique that allows the production of categories by purely automatic means (Oakes, 1998, p. 95). Clustering can therefore be used in order to calculate degrees of similarity of difference between multiple texts, based upon criteria set by the researcher. While clustering techniques have a useful application in document retrieval (van Rijsbergen, 1979), Oakes (ibid., p. 110) also notes that in corpus linguistics various identifiable features such as case, voice or choice of preposition within a text may be clustered in order to demonstrate how such features are used across different genres or by different authors” (Baker, Hardie, and McEnery, 2006, p. 34)
  • Colligation
    • Colligation is a term used to describe the tendency of a word to collocate with certain grammatical words or categories, or for certain grammatical features to co-occur. For example, the association of decide with a to-infinitive may be described as colligation, and so may the tendency of the perfect tense to co-occur with phrases beginning with for or since.” (CELS, Univ. of Bham)
  • Collocate
    • “One word collocates with another if the two occur together with a frequency that has a statistical significance. The two words can then be considered to be collocates.” (CELS, Univ. of Bham)
  • Collocation
    • “The tendency of words to co-occur in a patterned way is known as collocation.” (CELS, Univ. of Bham)
    • “The phenomenon whereby particular lexical items occur predominantly, or with a high probability, with particular, identifiable other lexical items.” (The Oxford Handbook of Computational Linguistics)
    • “Described by Firth (1957, p. 14) as ‘actual words in habitual company’, collocation is the phenomenon surrounding the fact that certain words are more likely to occur in combination with other words in certain contexts. A collocate is therefore a word which occurs within the neighbourhood of another word. For example, within WordSmith users can specify a window within which collocational frequencies can be calucated. Table 3 shows the top ten collocates for the word time in the Brown Corpus, withink a -5 to +5 span (the most common collocational position is emboldened for each word):… Such tables tend to elicit high-frequency funtion words, which although useful, does not always show an exlusive relationship between two words. For example, the occurs as  a collocate next to many other words, as well as time. We would perhaps find lower-frequency lexical words such as waste, devote, spend, spare and limit to be more illustrative collocates of time. Corpus linguistics techniques have therefore allowed researchers to demonstrate the frequency and exclusivity of particular collocates, using statistical methods such as mutual information, the Z-score (Berry-Rogghe, 1973), MI3 (Oakes, 1998, pp. 171-172), log-log (Kilgarrff & Tugwell, 2001) or log-likelihood (Dunning, 1993) scores. Each methods returns a value showing strength of collocation, but their criteria for assignment differ. For example, mutual information foregrounds the frequency with which collocates occur together as opposed to their independent occurrence whereas it is more probable that log-likelyhood will register strong collocation when the individual words are themselves frequenct. So mutual information will give a high collocation score to relatively low-frequency word pairs like bits/bobs, whereas log-likelihood will give a higher score to higher-frquency pairs such as school/teacher. Collocations can be useful in terms of language teaching – making sutdents aware of low-frequency collocates that native speakers have internalised (e.g. Hoffman & Lehmann, 2000). In addition, collocates can be useful for demonstrating the existence of bias or connotation in words. For example, the strongest collocate in the British National Corpus (BNC) of the word bystrander is innocent, suggesting that even in cases where bystander occurs without this collocate, the concpet of innocence could still be implied.” (Baker, Hardie, and McEnery, 2006, pp. 36-38)
  • Computer-Assisted Language Learning

    • “Any use of computers to provide language instruction or to support language learning.” (The Oxford Handbook of Computational Linguistics)
  • Concordance
    • “A list of all the occurrences of a particular search term in a corpus, presented within the context in which they occur (Baker et al, 2006, p. 42)
    • “A list showing all the occurrences and contexts of a given word or phrase, as found in a corpus, typically in the form of a KWIC index.” (The Oxford Handbook of Computational Linguistics)
  • Corpus
    • “a corpus is a collection of naturally-occuring language text, chosen to characterize a state or variety of a language” (Sinclair, 1991, p. 171)
    • “a corpus is a collection of piece of language that are selected and ordered according to exploit linguistic criteria in order to be used as a sample of the language” (Expert Advisory Group on Language Engineering Standards, EAGLES (1994: Section 2.1)
    • corpus: (i) (loosely) any body of text, (ii) (most commonly) a body of machine-readable text, (iii) (more strictly) a finite collection of machine-readable text, sampled to be maximally representative of a language or variety” (McEnery & Wilson, 1996, p. 177)
    • “a corpus is a collection of electronic texts, written or spoken, which is stored on a computer” (O’Keeffe, McCarthy & Carter, 2007, p. 1)
    • “a corpus is a collection of texts… in any language(s)… spoken (transcribed) or written… usually naturally-occurring (not written specifically)… stored and searchable electronically” (Hunston, 2008, CLARET Workshop, Birmingham, UK)
    • “A corpus is a collection of naturally occurring spoken or written texts which is searchable electronically” (Jung, 2011, p. 32)
    • “A body of linguistic data, usually naturally occurring data in machine readable form, especially one that has been gathered according to some principled sampling method.” (The Oxford Handbook of Computational Linguistics)
  • Corpus linguistics
    • “The study of language based on examples of real-life language use.” (McEnery & Wilson, 1996, p. 1)
    • “A computer-assisted methodology that addresses a range of questions in linguistics by empirical analysis of bodies of naturally occurring speech and writing.” (The Oxford Handbook of Computational Linguistics)
  • Coverage
    • “In lexicography, the extent to which a dictionary or lexicon has entries for all the words in the target language. Since the vocabulary of a natural language is anon-finite set, 100 per cent coverage is an impossible goal, hence compromises are necessary to achieve acceptable coverage.” (The Oxford Handbook of Computational Linguistics)

D
  • Derivation
    • “(i) In morphology, the production of new words-usually of a different part-of speech category-by adding a bound morph to a base form; (ii) in formal languages, the transformation of a string into another string by means of the application of the rules in a grammar.” (The Oxford Handbook of Computational Linguistics)
  • Determiner
    • “A determiner is a word such as the, a, my, each which occurs at the beginning of a noun group.” (CELS, Univ. of Bham)
  • Dialogue corpus
    • “A corpus consisting of a collection of dialogues.” (The Oxford Handbook of Computational Linguistics)
  • Distribution
    • “The variety of different texts in a language or a corpus in which a particular word or phrase is used. Some terms tend to cluster in particular domains or text types.” (The Oxford Handbook of Computational Linguistics)
  • Domain
    • “A distinct or specified area of language, dialogue, or discourse. Some words, phrases, and structures tend to be associated with particular domains (e.g. ‘renal’ is associated with the medical domain), while others have meanings that are domain-specific (e.g. ‘treat’, ‘cure’, and ‘patient’ have meanings that are particularly associated with the medical domain). Domain boundaries tend to be fuzzy, and the number of domains is non-finite. Domain is one of the variables that define types of dialogue, e.g. travel, transport, appointment scheduling etc.” (The Oxford Handbook of Computational Linguistics)

E
  • Ergative verb
    • “An ergative verb is a verb such as open which can be used transitively, with an object (she opened the door) or intransitively, without an object (the door opened). The object of the transitive use can be the subject of the intransitive use.” (CELS, Univ. of Bham)

G
  • Grammatical word
    • “A word which belongs to a closed class of items, such as pronouns, and which carries little meaning, is often called a grammatical word. Grammatical words include determiners, pronouns, prepositions and conjunctions.” (CELS, Univ. of Bham)

H
  • Hapax
    • “In corpus linguistics, a hapax is a word which occurs only once in a corpus.” (CELS, Univ. of Bham)

I
  • Idiom principle
    • “In Sinclair’s theory of language, the idiom principle is one way of interpreting language, that takes the meaning from a whole phrase, rather than from the individual components of it.” (CELS, Univ. of Bham)
  • ‘-ing’ clause
    • “An ‘-ing’ clause is a clause beginning with the ‘-ing’ form of a verb e.g. she liked riding her bicycle quickly over the fields.” (CELS, Univ. of Bham)
  • Intercalated text
    • “In translation studies, an intercalated text is one where the original text and the translated text are presented line by line, so that the translation of each line of the original is shown directly below the original.” (CELS, Univ. of Bham)

L
  • Lemma
    • “All the forms of a noun, verb, adjective or adverb together are known as the lemma. For example, the lemma GO includes the word-forms go, goes, going and went.” (CELS, Univ. of Bham)
    • “A set of lexical forms having the same stem and belonging to the same major word class, differing only in inflection or spelling” (Francis & Kucera, 1982, p. 1)
    • “The canonical form of a word, usually the base form, taken as being representative of all the various forms of a morphological paradigm.” (The Oxford Handbook of Computational Linguistics)
  • Lemmatization
    • “The process of grouping the inflected forms of a word together under a base form, or of recovering the base form from an inflected form, e.g. grouping the inflected forms ‘run’. ‘runs’, ‘running’, ‘ran’ under the base form ‘run’.” (The Oxford Handbook of Computational Linguistics)
  • Lexical word
    • “A word which is not a grammatical word is a lexical word. Lexical words carry meaning and do not belong to closed sets. Nouns, verbs, adjectives and adverbs are all lexical words.” (CELS, Univ. of Bham)
  • Link verb
    • “A link verb is a verb such as be, become or seem.” (CELS, Univ. of Bham)

M
  • Monitor corpus
    • “A corpus that enables diachronic research by continually growing over time. While a monitor corpus can operate with respect to an overall sampling frame, monitor corpora tend to de-emphasize the role of sampling frame specification in favour of a large volume of material which is continually supplemented over time.” (The Oxford Handbook of Computational Linguistics)
  • Monolingual corpus
    • “A corpus in which all of the texts belong to the same language.” (The Oxford Handbook of Computational Linguistics)
  • Morph
    • “The actual realization of an abstract morpheme as part of a word.” (The Oxford Handbook of Computational Linguistics)
  • Morpheme
    • “Any of the basic building blocks of morphology, defined as the smallest units in language to which a meaning may be assigned or, alternatively, as the minimal units of grammatical analysis. Morphemes are abstract entities expressing basic semantic or syntactic features.” (The Oxford Handbook of Computational Linguistics)

N
  • Node word
    • “In corpus linguistics, the node word is the central word in a set of collocations; the word which the computer has taken as the central word for the search.” (CELS, Univ. of Bham)
  • N-gram
    • “A sequence of n tokens.” (The Oxford Handbook of Computational Linguistics)

P
  • Parallel corpora
    • “Two or more corpora in which one corpus contains data produced by native speakers of a language while the other corpus/corpora have that original translated into another/a range of other languages.” (The Oxford Handbook of Computational Linguistics)
  • Part-of-speech
    • “Any of the basic grammatical classes of words, such as noun, verb, adjective, and preposition.” (The Oxford Handbook of Computational Linguistics)
  • Part-of-speech tag
    • “A label specifying a part of speech.” (The Oxford Handbook of Computational Linguistics)
  • Part-of-speech tagger
    • “A computer program for assigning labels for grammatical classes of words.” (The Oxford Handbook of Computational Linguistics)
  • Part-of-speech tagging
    • “Assigning labels for grammatical classes of words through a computer program.” (The Oxford Handbook of Computational Linguistics)
  • Phrase
    • “A sequence of words that can be processed as a single unit in a text.” (The Oxford Handbook of Computational Linguistics)

R
  • Regular expression
    • “An expression that describes a set of strings (= a regular language) or a set of ordered pairs of strings (= a regular relation). Every language or relation described by a regular expression can be represented by a finite-state automaton. There are many regular expression formalisms. The most common operators are concatenation, union, intersection, complement (=negation), iteration and composition.” (The Oxford Handbook of Computational Linguistics)

S
  • Second language acquisition
    • “The learning of a second or subsequent language. The term acquisition is preferred by people who wish to emphasize the role of unconscious, automatic processes in learning.” (The Oxford Handbook of Computational Linguistics)
  • Second language learning
    • “The learning of a second or subsequent language in a country where that language is spoken, i.e. in a situation where there is typically a great deal of natural interaction in the new language. Compare foreign language learning.” (The Oxford Handbook of Computational Linguistics)
  • Segment
    • “(i) (Verb) the act of splitting up a dialogue into utterance units; (ii) (noun) any of the subunits into which a text may be divided; (iii) (noun) a unit of sound in phonetics; (iv) (noun) an alternative term for utterance unit, best avoided as it can easily be confused with (ii) or (iii).” (The Oxford Handbook of Computational Linguistics)
  • Semantic prosody
    • “If a word is often used with other words with a positive or negative meaning, such that the word gets a positive or negative connotation, this is known as semantic prosody.” (CELS, Univ. of Bham)
  • Span
    • “In corpus linguistics, the node word and a given number of words to the left and right are known as the span. This is used to calculate collocations.” (CELS, Univ. of Bham)
  • Spoken corpus
    • “A corpus that seeks to represent naturally occurring spoken language. While this could in principle be simply a collection of tape recordings, it is much more common to find that such material has been orthographically transcribed. It may also be that the material has been phonemically transcribed either in addition to, or instead of, an orthographic transcription, sometimes with suprasegmental markings.” (The Oxford Handbook of Computational Linguistics)
  • Spontaneous speech
    • “Speech that is formulated freely without the use of written cues or any careful preparation (e.g. ordinary conversation, face to face or on the phone).” (The Oxford Handbook of Computational Linguistics)

T
  • t-score
    • “A t-score is a statistical measure which compares actual frequency with expected frequency. It measures the certainty of a collocation.” (CELS, Univ. of Bham)
  • Tag
    • “A grammatical label, typically one attached to a word in context, expressing its part of speech and inflection, or in some cases semantic or other information.” (The Oxford Handbook of Computational Linguistics)
  • Tag set
    • “(i) The set of labels (which are usually part of a mark-up language) used to tag a text for computational processing. Compare annotation; (ii) the set of XML or other labels used to structure a dictionary entry systematically.” (The Oxford Handbook of Computational Linguistics)
  • Tagging
    • “Assignment of tags to words or expressions in a text.” (The Oxford Handbook of Computational Linguistics)
  • That-clause
    • “A that-clause is a finite clause that may begin with that but does not always do so e.g. I thought that he was tired; he said he was tired.” (CELS, Univ. of Bham)
  • To-infinitive clause
    • “A to-infinitive clause is a non-finite clause that begins with a to-infinitive e.g. she told them to go.” (CELS, Univ. of Bham)
  • Token
    • “In corpus linguistics, the number of tokens in a text is the total number of running words.” (CELS, Univ. of Bham)
  • Tokenization
    • “The process of segmenting text into linguistic units such as words, punctuation, numbers, alphanumerics etc.” (The Oxford Handbook of Computational Linguistics)
  • Tokenizer
    • “A software program that performs text tokenization and determines boundaries for individual tokens (words, numbers, punctuation) in text.” (The Oxford Handbook of Computational Linguistics)
  • Type
    • “In corpus linguistics, the number of types in a text is the number of different words in it.” (CELS, Univ. of Bham)
  • Type-token ratio
    • “In corpus linguistics, the type-token ratio is obtained by dividing the number of types by the number of tokens.” (CELS, Univ. of Bham)

U
  • Unit of meaning
    • “In Sinclair’s theory of language, a unit of meaning is a phrase that carries a meaning and into which a text may be divided.” (CELS, Univ. of Bham)
  • Utterance
    • “A unit of spoken text, typically loosely defined and used. On the structural level utterances may correspond to phrases or sentences uttered by a speaker, whereas on the functional level they may correspond to dialogue acts. valence The number and kinds of words and phrases a word can combine with in regular patterns. The valence of a word is often called its subcategorization.” (The Oxford Handbook of Computational Linguistics)

W
  • Word class
    • “All the words that behave in a particular way are known as a word class. For example, dog, attitude and brother in some ways behave similarly. They all belong to the word class ‘noun’.” (CELS, Univ. of Bham)
  • Word list
    • “A list of all of the words that appear in a text or corpus, often useful for dictionary creation. Word lists often give the frequencies of each word (or token) in the corpus. Words are most usually ordered alphabetically, or in terms of frequency, either with a raw frequency count and/or the percentage that the word contributes towards the whole text. Additionally, word lists can be lemmatised or annotated with part-of-speech or semantic information (including probabilities – for example, the word house occurs as a noun about 99 per cent of the time and as a verb 1 per cent of the time). Word lists are needed when calculating key words and key key words.” (Baker et cl., 2006, p. 169)

XYZ

한국어 코퍼스 용어(Corpus Glossaries in Korean)

  • 공시 코퍼스(synchronic corpus)
    • “일정 시점의 언어 상태를 파악할 수 있도록 동시대라고 볼 수 있을 정도로 짧은 기간 동안의 언어 자료를  체계적으로 구축한 것을 공시 코퍼스라고 한다.” (권혁승, 정채관, 2012, p. 11)


  • 다국어 코퍼스(multilingual corpus)
    • “3개 국어 이상의 텍스트로 구성된 것을 다국어 코퍼스라고 한다.” (권혁승, 정채관, 2012, p. 12)



  • 범용/일반 코퍼스(general corpus)
    • “한 언어의 다양한 사용역(register) 자료를 적절한 비율로 구성하여 해당 언어의 대표성을 가질 수 있도록 구축한 것을 범용/일반 코퍼스라고 하며…” (권혁승, 정채관, 2012: 10-11)
  • 병렬 코퍼스(parallel corpus)
    • “동일한 내용이 2개 국어 텍스트로 구성된 것을 병렬 코퍼스라고 하며…” (권혁승, 정채관, 2012, p. 12)


  • 연어 관계(collocation)
    • 연어 관계는 ‘텍스트 내의 좁은 공간에서 두 개 이상의 단어가 공기하는 것’으로 정의되는데 함께 어울려서 쓰이는 단어들의 연관 관계라는 개념이 강조된다.” (권혁승, 정채관, 2012, p. 31)
  • 연어의 기준
    • “연어의 정의가 다양하듯이 연어를 판별하는 기준 역시 다양하다. 연어를 판별하는 일반적인 기준은 크게 네 가지가 있다(Benson, 1989; Manning & Schütze, 1999). 첫째, 비합성성이다. 연어를 구성하는 각 단어의 의미를 합성하여 연어의 의미를 직접적으로 얻을 수 없다. 연어의 의미는 각 단어의 의미와 완전히 다른 의미를 가지는 경우(예: kick the bucket)도 있고, 전체의 의미를 예측할 수 없지만 연어에 중심이 되는 단어(연어핵)에 의미를 더해지는 경우(예: white wine, white hair, white women, strong tea)도 있다. 둘째, 비대체성이다. 연어를 구성하는 한 단어를 유사한 의미를 가지는 다른 단어로 대체할 경우, 전혀 사용되지 않는 표현이거나 문맥적으로 어색한 표현이 될 수 있다. 예를 들며 ‘white wine‘에 대해서 ‘white‘를 ‘yellow‘로 교체하여 ‘yellow wine‘라는 표현을 생성할 수 있다. 그러나 ‘yellow wine‘는 거의 사용되지 않는 표현이다. 설령 ‘yellow‘가 ‘white wine‘의 실제 색깔을 나타내더라도 ‘yellow wine‘이라고는 사용하지 않는다. 셋째, 비변형성이다. 많은 연어는 다른 단어가 수식할 수도 없고, 심지어 문법적인 변형(예: 단수를 복수로 변형)도 허락하지 않는다. 많은 경우 숙어와 같은 고정 표현일 경우에는 더욱 그렇다. 예를 들어 ‘to get a frog in one’s throath‘이라는 표현을 ‘to get an ugly frog in one’s throath‘과 같이 ‘ugly‘를 추가할 수도 없고 ‘frog‘를 ‘frogs‘로 변형할 수도 없다. 넷째, 비번역성이다. 연어를 구성하는 각 단어를 다른 언어로 번역했을 때, 그 의미가 제대로 전달되지 않는다. 이 기준은 연어를 판별하는 매우 좋은 기준이 될 수 있다. 예를 들면 ‘make a decision‘에 대해서 각 단어를 불어로 번역하면 ‘faire une décision‘이 될 수 있으나 이 표현은 사용되지 않는 표현이며, ‘prendre une décision‘ 올바른 번역이다. 따라서 ‘make a decision‘은 연어라고 할 수 있다. (권혁승, 정채관, 김재훈, 2018, p. 110-111)
  • 연어의 유형(types of collocation)

    • 연어의 유형을 분류하는 방법도 여러 학자들마다 매우 다양하나 크게 어휘적 연어와 문법적 연어로 나눈다(Benson et al., 2009; 문금현, 2002; 임근석, 2008). 어휘적 연어는 내용어들 사이의 긴밀한 통사적 결합 구성으로, 선택의 주체가 되는 내용어(연어핵)가 선택의 대상이 되는 내용어(연어변)을 선호하여 이룬 구성이다(임근석, 2006). 즉 명사, 동사, 형용사, 부사와 같은 내용어들 간의 긴밀한 공기관계를 형성하는 구성을 말한다. 어휘적 연어는 일반적으로 문법관계를 포함하고 있다. 따라서 한국어의 경우에는 조사나 어미와 같은 문법소가 포함된다. 문법적인 기준에 따라 주술 관계, 목술 관계, 수식 관계로 분류되며 아래와 같은 예들이 있다(문금현, 2002). 문법적인 연어는 내용어와 기능어 사이의 긴밀한 통사적 결합 구성으로, 선택의 주체가 되는 내용어(연어핵)가 선택의 대상이 되는 기능어(연어변)를 선호하여 이룬 구성이고(임근석, 2006), 문법적인 호응관계를 가지며, 그 예는 다음과 같다. (중략) 자유로운 문법적인 결합으로 설명되지 않는 이러한 단어들의 결함은 전통적으로는 관용구, 숙어, 고정 표현 등의 개념으로 다루어 왔다. 자연언어처리 분야에서는 이를 다중언어라고 하고, 교육학 분야에서는 정형 표현이라고도 한다. 영어는 경동사, 동사의 분사 구문, 전문용어들이 언어학적으로 본 연어의 유형이다.” (권혁승, 정채관, 김재훈, 2018, p. 111-112)
  • 연어 관계 측정 방식
    • “위에서 살펴본 것처럼 연어 관계를 구하는 통계적 측정 방식은 여러 가지가 사용되고 있으며 각각의 측정 방식은 서로 매우 상이한 결과를 보여주기도 한다. 따라서 어떤 측정 방식을 사용할 것인가는 찾고자 하는 연어의 유형이나 검색 조건에 따라 달라질 수 있다. 그러므로 연어 리스트를 구하기 위해서는 처음부터 다양한 측정 방식을 다양한 방식으로 적용하여 본 다음 가장 타당한 방식을 채택하는 방법이 바람직하다. 마찬가지로 연어 리스트 검색 결과를 해석하고 활용하는 데에는 어떤 측정방식이 어떻게 사용되었는지를 명확하게 이해해야 한다. 창 연어 관계는 공간적 개념을 도입한 확률적 관계이기 때문에 연어 검색 유형, 공간 설정, 측정 방식 등에 따라 검색 결과가 달라질 수 있다. 인접 연어 관계와 달리 중심어에 인접하여 직접적으로 어휘적-문법적 관계에 있을 필요가 없이 중심어와 일정 범위 내에 있는 모든 단어 간의 확률적 결합 관계라고도 볼 수 있는데, 이러한 단어 간의 결합 관계가 경우에 따라서는 인접 연어 관계에 놓일 수도 있다.” (권혁승, 정채관, 김재훈, 2018, p. 118-119)
  • 엠 아이 스코어(MI-score)
    • 연어의 통계적 유의도를 측정하는 방식에는 MI-score, t-score, z-score 등이 있다. 가장 많이 쓰이는 방식은 MI-score인데 두 단어가 전체 코퍼스에서 각각 개별적으로 나타난 빈도수와 두 단어가 연속으로 나타난 빈도수를 비교하여 얻은 수치이다.” (권혁승, 정채관, 2012, p. 32)
  • 원시 코퍼스(raw corpus)
    • “코퍼스 자료인 텍스트에 부가적인 정보를 덧붙이지 않고 텍스트를 전자 형태 그대로 구축한 것을 원시 코퍼스라고 하며…” (권혁승, 정채관, 2012, p. 10)
  • 원어민 코퍼스(native corpus)
    • “언어 자료가 그 언어를 모국어로 하는 사람들의 것을 원어민 코퍼스라고 하며…” (권혁승, 정채관, 2012, p. 11)
  • 웹 코퍼스(web corpus)
    • “21세기에 접어들면서 인터넷상의 모든 웹 자료를 데이터베이스로 사용하는 새로운 개념의 코퍼스가 등장하였는데 이를 웹 코퍼스라고 부른다” (권혁승, 정채관, 2012, p. 13)

  • 주석 코퍼스(annotated corpus)
    • “텍스트에 부가적인 정보를 덧붙여 컴퓨터로 유형 검색을 용이하게 할 수 있도록 만든 것을 주석 코퍼스라고 한다.” (권혁승, 정채관, 2012, p. 10)
  • 제트 스코어(Z-score)
    • Z-score는 MI-score보다 코퍼스에서 중심어의 총 빈도수에 비중을 조금 더 두어서 Z-점수값을 구하는 계산식이다.” (권혁승, 정채관, 김재훈, 2018, p. 117)

  • 창 연어 관계(window collocation)
    • 창 연어 관계는 Firth의 의미적 개념에서 출발하여 Sinclair의 공간적 개념을 도입한 확률적 관계로 발전되었으며, 대규모 전산 코퍼스의 사용이 보편화되면서 다양한 통계적 계산방식이 개발되었다. Sinclair가 제시한 연어 관계의 핵심은 중심어를 기준으로 좌우 일정한 공간을 하나의 창으로 삼아 이 공간 내에 나타나는 모든 단어들을 통계적으로 분석하여 상호간 유의도를 측정하여 연어를 찾아내는 방식이다.”  (권혁승, 정채관, 김재훈, 2018, p. 113)

  • 코퍼스(corpus)
    • “‘코퍼스‘란 언어 자료의 모둠이다. 여기서 언어란 말(spoken language)과 글(written language)를 모두 포함한다.” (권혁승, 정채관, 2012, p. 1)
  • 코퍼스 소프트웨어(corpus software)
    • “코퍼스와 관련된 언어분석을 하거나 이와 같은 연구를 통해 나온 결과를 교수 및 학습에 활용할 때 사용하는 컴퓨터 프로그램” (권혁승, 정채관, 2012, p. 54)
  • 코퍼스 언어학(corpus linguistics)
    • “‘코퍼스 언어학‘이란 코퍼스를 사용하여 언어를 연구하는 학문이다. 언어 연구의 전통적인 분야 중 형태, 음성, 음운, 구조, 의미와 같이 언어의 특정한 분야나 특정한 이론을 연구하는 것이 아니라, 코퍼스 언어학은 언어 연구의 제 영역에서 다양한 언어 현상에 대한 해답을 찾기 위해 코퍼스를 이용하여 언어를 분석하고 기술하는 연구 방법론이다.” (권혁승, 정채관, 2012, p. 2-3)
  • 콘코던스(concordance)
    • 콘코던스는 영어에서 차용한 용어이며 어구색인 또는 용례색인이라고 부르기도 하는데, 코퍼스에서 추출한 용례를 퀵(KWIC: Key Word in Context) – 즉, 문맥 내 중심어 – 방식으로 제시하는 것을 말한다.” (권혁승, 정채관, 2012, p. 26)
  • 키워드 추출 방식(keyword calculation)
    • 키워드는 연구대상 코퍼스와 참고코퍼스를 비교하여 통계학적으로 유의미할만큼 그 빈도수가 유난히 높거나 낮은 결과를 바탕으로 추출된다. 이때 사용하는 통계방식은 코퍼스의 특성에 따라 결과값이 조금씩 달라지므로, 어떤 통계방식을 선택할 것인지를 잘 고려해야 한다. (Source: https://lexically.net/downloads/version7/HTML/keywords_calculate_info.htm)
  • 퀵(KWIC: Key Word in Context)
    • 은 중심어(keyword)를 그 단어가 쓰인 환경(context) 속에서 정력하여 보여주는 방식인데 일종의 색인(index)라고 할 수 있다.” (권혁승, 정채관, 2012, p. 26)

  • 텍스트 아카이브(text archive)
    • 텍스트 아카이브는 텍스트를 모아 데이터베이스를 구축한 것을 말하며, 개별 텍스트의 집합이 특정 (영역의) 언어를 대표하지는 않는다는 점에서 코퍼스와 차이가 있다. 대표적인 아카이브로는 다양한 코퍼스와 고문헌 텍스트로 구성된 옥스포드 텍스트 아카이브와 저작권이 없거나 말료된 문학 작품으로 구성된 프로젝트 구텐베르크가 있다.” (권혁승, 정채관, 2012, p. 12-13)
  • 통시 코퍼스(diachronic corpus)
    • “일정 기간의 언어 변화를 살펴볼 수 있도록 비교적 오랜 기간 동안의 언어 자료를 체계적으로 구축한 것을 통시 코퍼스라고 하며…” (권혁승, 정채관, 2012, p. 11)
  • 특수 코퍼스(specialized corpus)
    • “특정 사용역의 언어만으로 구성하여 해당 영역의 언어적 특징을 연구할 수 있도록 구축한 것을 특수 코퍼스라고 한다.” (권혁승, 정채관, 2012, p. 11)


  • 학습자 코퍼스(learner corpus)
    • “언어 자료가 그 언어를 외국어로서 학습하는 사람들의 것을 학습자 코퍼스라고 한다.” (권혁승, 정채관, 2012, p. 11)
코퍼스연구소 둘러보기(Site Map)