Given a corpus of texts C, then a concordance for a certain word form W (taken as a type) is the set of all occurrences (tokens) of W in C, arranged in a table such that
Derivatively, a concordance of corpus C is the set of the concordances of all word types Wi in C.
The following is a concordance of of in the first paragraph of the section ‘Reference entries’ of the page ‘Lemmatization’.
reference | preceding context | target | following context |
---|---|---|---|
lemm 2.4, 01 | Lemmatization is a decision in favor | of | one form of an expression which |
lemm 2.4, 01 | decision in favor of one form | of | an expression which is considered its |
lemm 2.4, 04 | destined for users with imperfect knowledge | of | the language in question |
The reference to the place of the token identifies
The context of the token in question must be limited in some sensible way. Even if, for many applications, it would seem desirable to reproduce the entire containing sentence, this would be too expensive and also unnecessary in the case of long sentences. In principle, one could reproduce some suitable construction (phrase) containing the token in question. However, that would require a human analyst, and that is undesirable not only for economical reasons, but also because the concordance is supposed to serve as a theory-free analytical tool in the first place: it does not presuppose the analysis of syntactic constructions, but instead helps do them on an empirical basis. Therefore the context in a concordance is mostly clipped mechanically, e.g. by limiting it to a certain number of text words at either side of the target. The user can always find the full context by following up the reference.
A concordance is based on a word list of a corpus, for which see the section on lemmatization.
Concordances may be produced for many different purposes. For the lexicographer, they show the range of contextual variation of each word form in his corpus. He needs that for the following analytic steps: