Concept
A corpus (Lehmann 2007[D]) is a collection of texts of a certain category that is accessible as a self-contained whole. The following properties are relevant to the notion and may define different kinds of corpora:
- A corpus is a kind of collection of linguistic data and differs from other kinds by the fact that it consists of texts. I.e., collections of syllables or the like are not generally called corpora. (They should be databases.)
- A corpus may be preexistent to scientific inquiry, as e.g. the corpus of the Hittite texts, or it may be compiled by a researcher.
- Traditionally, the term ‘corpus’ implies exhaustion of all the texts of a given category; e.g. the Corpus Scriptorum Ecclesiasticorum Latinorum. Today, a sample of texts is also called a corpus.
- The category of the texts may be inherent to them, as in the corpus of the Platonic texts, or it may be the result of the aim pursued by the researcher, e.g. a corpus sampled from all the books contained in some library.
- A corpus may exist in various physical forms: written/printed texts, audio records, electronic files. However, audio files are normally transcribed before use, and non-electronic corpora have gradually been converted into an electronic format.
Role
For the researcher, the corpus is a means toward some end. The approach of the philologist and of the linguist to a corpus typically differ in this respect:
- The philologist is essentially confronted with a preexistent corpus that he wants/has to account for; and this defines the scope of his task.
- The linguist may use a preexistent corpus or establish one according to his research interest (s. on data collection). The latter typically transcends the corpus; the corpus is only a specimen of the language the linguist is interested in.
At any rate, a corpus is a set of processed primary data. Producing such a set of data is a scientific task in itself. Other scientists may then rely on and use such data.
Purpose
Corpora may be set up for many different purposes. If the purpose is to document a language, one will try to produce what may be called a general-purpose corpus. This is a corpus that aims to represent the band-width on all the dimensions of variation in a language. This may be obtained by taking a theory of the speech situation as the point of departure, varying all the parameters involved systematically and crystalizing text genres as specific combinations of values on these parameters. This is attempted on the table ‘Parameters of speech situation and text genres’ (Lehmann 2001, §5.2).
In the collection of data for the purpose of describing a language, a text corpus enjoys a priviledged role. Natural texts in a corpus present linguistic data in their context. A corpus is, thus, closer to representing the ultimate substrate of linguistics than other kinds of linguistic data.
Given that there are two principal ways of obtaining linguistic data, viz. from a corpus of texts or by elicitation of data from native speakers, there was at times a controversy among descriptive linguists concerning the priority of one of these. Since the 1980s, there has been a notion that only natural texts are a reliable data basis, while elicitation is unreliable. In conclusion of this debate, it may be stated that neither of the two methods of data collection is suffient alone; they must be combined.
Comparative corpora
For certain purposes of general-comparative linguistics including typology, it is useful to have translation equivalents of a given text in different languages. The bible is an often-used example; another one are the many translations of Le petit prince by Saint-Exupéry. Beside their undeniable advantages, such sets of texts have the disadvantage of being translations of an original. A translation is less spontaneous than its original, and its style may not be representative of natural texts of the language.
A method of overcoming this disadvantage is to have speakers of different languages tell a narrative about the same subject matter, e.g. a silent movie that they watched. The pear stories (Chafe [ed.] 1980) are a case in point. The degree of comparability of the texts so produced depends on many factors; but in general the method has proved fruitful for the comparison of strategies put to work in the solution of specific tasks of cognition and communication.