The microstructure of a dictionary – more precisely, of a word list – is the internal structure of any of its entries.
The outline of the microstructure of a dictionary presented below obeys the following principles:
The complete set of fields, with their numbers, names, explanation, examples and linking relations to other parts of the overall linguistic description, is listed in a table on a separate page.
Each entry of a dictionary must be identifiable, both by references from within the dictionary and by references from outside (e.g. the accompanying grammar). If the lexical database is a relational database, then each entry will have an ID in the technical sense. However, since that is a user-unfriendly number, it is normally not suitable for human users. Instead, for the identification of an entry, these rely on the following pieces of information:
The lemma is given in standard orthographic representation. Sometimes, the lemma representation itself is used for additional purposes, e.g. to indicate stress, syllable boundaries or word-break points for hyphenation. However, the present comprehensive microstructure provides dedicated fields for such purposes.
If the lemma is a segmental sign, but not an independent word form, it is flanked by a hyphen (or two hyphens) at the side where it is bound, like English cran- and -ize. If it is a morphonological process or a suprasegmental sign, such as German metaphony marking plural or Yucatec Maya high tone marking deagentive formation, a suitable notation has to be developed. Such conventions are explained in the appropriate main section of the dictionary.
Homonyms are, of course, separate entries distinguished by numbers; see the separate section for details. The same goes for the readings of a polysemous entry; see next item.
As shown in the section on lexical relations, a relational lexical database provides an elegant way of both keeping the distinction between homonymy and polysemy flexible and of providing its own partial microstructure for each of the readings of a polysemous item. These readings are numbered from 0 (mother entry) to n.
Normally, the lemma should be given in the citation form that is traditional in the speech community. In that case, the content of this field is identical to the content of field #1. However, the dictionary also contains lemmas – especially dependent forms (roots, stems, affixes ...) that would not be cited in their naked form by non-linguists. In such cases, the field ‘citation form’ must be filled in. For instance, Yucatec Maya verb roots are never cited as such, but with a default stem extension, for instance: lemma kul- (‘sit’), citation form: kutal. This field is also the reference point for field #6 below.
In every natural language that is at all represented in a dictionary, an expression has at least two significantia, a phonological and a graphemic one.
The phonological representation is given in IPA. If the citation form differs from the lemma, this field may refer to either of them. In practice, filling in this field may be limited to such cases where the phonological representation is not derivable by rule from the orthographic representation.
This is a link to a sound file. There are at least two possibilities here:
This field accounts for idiosyncratic phonological variation. For instance, the first phoneme of economic may be /ɛ/ or /i/.
Other variants are derivable by phonological rule. For instance, the rule of syncope in German predicts that if there is a lemma Wanderer, there will be a variant Wandrer. This rule is stated in the phonology section of the dictionary and then renders superfluous the enumeration of all the variants it generates.
The standard graphemic representation already serves as lemma. This field is needed for alternative spellings found in the corpus. For example, the lemma encyclopedia contains encyclopaedia in this field. The lesser the degree to which the language is standardized, the more variation there is in available texts (e.g. of earlier periods); and the fewer dictionaries there are for the language, the more important does it become to display the variants in this field.
Again, such orthographic variants which are based on regular phonological variants do not need to be noted individually.
The language that constitutes the object of the description including the dictionary determines the corpus set up for the research. It must be defined in the introduction to the dictionary. The definition will necessarily, and explicitly, exclude certain kinds of linguistic variation (cf. the section on variation). For instance, diachronic variation may be limited to one of the stages traditionally defined for the language. All the variation not explicity excluded will be represented in the corpus and have to be categorized. This information is also called diasystematic marking.
The possible contents of the following fields may be defined as range sets.
Assuming that the lexicon is not confined to one dialect, dialect appurtenance of the lemma is indicated here. One possible value of this field is ‘standard’.
Again, this will usually be marked only if the lexical item is special. Relevant values include particular age groups or professions.
This concerns style, register, connotations and any kind of pragmatic information. The same restrictions as in the previous fields apply. Relevant values include ‘ritual’, ‘formal’, ‘vulgar’. Cf. Wahrig 1973, ch. 4.
A lexicon is usually confined to one stage of a language. Other stages may come in in two ways:
This section of the microstructure concerns both the internal structure of the stem representing the lexeme – its inflectional and derivational morphology – and its distribution in syntax and phraseology. The concepts used are those introduced systematically in the grammar coupled with the lexicon; and they will appear in print in the grammar section of the published dictionary. Whenever such a term appears in a lexical entry, reference to that grammar (section) is implied. Cf. Hausmann 1977, ch. 6, Wahrig 1973, ch. 2.
What is meant here is the proper name of a grammatical morpheme. For instance, the proper name of the English suffix -ize is ‘verbalizer’, and the proper name of English 's is ‘(Saxon) genitive’. Consequently, the possible contents of this field are unique (i.e. there is no range set), and only a portion of the entries of the lexical database will be specified for this field, viz. the grammatical formatives.
From among the grammatical categories of a lexical item, this field is dedicated to its syntactic category qua distributional category (for morphological categories see #18). This is understood as a narrow subcategory of a part of speech, e.g. ‘proper noun’, 'transitive verb with additional prepositional complement'. The taxonomy implied here will be explained in the grammar.
A lemma which belongs to diverse syntactic categories is considered polysemous. Each category then constitutes a record.
This field contains the immediate constituents of the lemma stem; as long as binarism obtains, there are two of them. In the case of a compound, they are two stems; in the case of a derivative, they are a stem and some derivational operator which may or may not be segmental. The items listed there are identical to certain lemmas of the database.
relational | set up a cross-table for derivational relations: column 1: ID of complex stem; column 2: ID of first constituent; column 3: ID of second constituent |
---|---|
free field structure | hyperlinks from the constituents to their target records |
This field contains the technical term for the word-formation process that formed the lemma stem, e.g. bahuvrihi, causative, denominal, deverbal, intensive etc. Possible entries in this field are taken from a range set defined in the grammar, where the word-formation processes of the language are dealt with systematically.
In this field, the last word formation process applied is indicated, i.e. the process which was applied to the components of field #15 to form the stem of the lemma. In the case of a derivationally complex lemma, other word formation processes may have created stems that are part of it, in particular those of field #15. Such processes are not indicated here, since they may be seen by following the links of the latter field.
In this field, the set of lemmas is referenced which have the current lemma in their field #15. Thus, references between the current field and that field are mutual.
relational | the cross-table already mentioned for field 15 provides the content both for that field and for the present field |
---|---|
free field structure | series of hyperlinks (converse to the ones of #15) to each of the derivative lemmas |
An alternative would be not to have entries for productively derived stems in the dictionary. Then the present field would contain the derivation schemata (of #16) which are applicable to the lemma stem.
The nature of the inflectional categories to be specified here depends on the language. Examples are noun class, gender, possessive class, verbal voice, inflection class. An inflecting word of a language may fall into diverse morphological categories at once, e.g. voice x
, conjugation class y
. Some may be syntactically relevant lexical classes such as the gender of a noun, others may be purely morphological classes such as inflection classes. It is practical to set up a separate field for each of these categories.
If the stem has inflected forms not derivable by rules pertaining to its inflection class, those forms are listed here. They may be both stem allomorphs such as worse, appearing in this field of the lemma bad, and irregular forms of the inflection paradigm, e.g. oxen appearing in this field of the lemma ox.
This field contains the syntactic and semantic construction frame (for a verb: its valency frame), including selection restrictions. A case in point are different constructions of complement for complement-taking verbs. This is a specification of the information contained in #14. It should be represented by a formal notation, e.g. [ ~ X ]Y, where ~ indicates the position of the lemma, X represents relevant syntactic constituents or properties of the context, and Y is the syntactic category of the construction.
This field lists collocations in which the lemma is involved. These may be any kind of fixed expressions, including phrases, idioms and proverbs (cf. Bergenholtz & Tarp, ch. 7.2). If one has decided to bestow lemma status to such complex expressions, then this field contains links to such lemmas.
Semantic information on the lemma is provided in different languages and from different points of view. The sum of the information contained in this subset of fields is highly redundant. It is, however, useful for different kinds of dictionaries to be output from the database.
The basis of the methodology of dictionary definitions is the logic and methodology of the definition in general. See the website devoted to this topic. For the present purpose, the meaning is specified in plain prose. What is easily formalizable about it is relegated to other fields of the lexical entry, in particular 24 and 25.
In specifying properties of an argument of a relational lexeme, care must be taken to distinguish selection restrictions for a possible argument from semantic features of the lexeme itself. For instance:
In the German example a), ‘[clothes]’ represents a selection restriction concerning the direct object of a transitive verb. In example b) instead, ‘clothes’ represents part of the meaning of an intransitive verb. If the brackets were missing in a), it would not be clear that ‘clothes’ does represent a selection restriction on a direct object.
This field contains a specification of the meaning of the lemma in plain prose of the same language, as it would be the case in a monolingual dictionary.
The following fields specify the meaning in a set of languages that may be relevant in the working context. These may include the following languages:
To the second of these fields may be added another one, called ‘native translation’, which contains literally the explanation that the informant gave. This differs from field #22.2, which generally contains the lexicographer's definition.
Since polysemous lexemes are split up over so many database records (see the relevant discussion), each lemma has only one sense or meaning. There is, thus, no necessity to provide for a special substructure of these fields.
Whenever a token of the lemma stem appears in a text provided by an interlinear morphological gloss, the gloss is the same for all tokens of that type. This is achieved by retrieving it from the lexicon.1 In principle, the gloss is provided in the same user languages as before. However, since only linguists are interested in interlinear glosses, an English gloss may suffice.
Care must be taken concerning the relation of the gloss to the lexical item. An item of an interlinear gloss corresponds to an itemi in the text thati is taken as a whole, i.e. not analyzed morphologically. That is, minimally, a morph. Whenever the lemma of a lexical entry consists of a morpheme, no problem arises for the gloss field. The same holds, in principle, if the lemma is a complex stem (which is not analyzed in the lemma field itself; see below). Problems do arise if the lemma is an inflected (citation) form, because then it will contain inflectional morphemes beside the stem. The gloss, however, must render the stem.
lemma | werden |
---|---|
stem | werd |
gloss | become |
Moreover, the morphological gloss does not need to have a morphological substructure. This is important for morphologically complex stems. For instance, the Yucatec Maya lexicon contains the following three items:
lemma | gloss |
---|---|
kim | die |
-s | -CAUS |
kims | kill |
The granularity of the glosses in a text is adjusted to the granularity of the morphological analysis executed there. That means:
Thus, the granularity of the gloss is not decided separately, but is a consequence of the granularity of the morphological analysis applied to the text. This, in turn, depends on the specific purpose pursued with the gloss (see Lehmann 2004, section 3.7). It is therefore inappropriate to provide a gloss ‘die:CAUS’ for the lexical entry kims.
Each lexical item – at least those with a lexical meaning – belongs to one or more semantic classes. For instance, leg is a body part, spider is an insect, laugh is an expression of emotion. This classificatory information is implicit or even explicit in a good definition (field #22). However, it is highly useful to specify it in a separate field:
The semantic classes form a range set. A proposal for a practical set of semantic classes applicable to many languages is on a separate page.
Even monosemous lemmas may belong to more than one semantic class. For instance, apple is both a plant part and a food (and so are all the fruits).
relational | a table of semantic categories and a cross-table connecting lemmas with such categories |
---|---|
free field structure | optionally more than one instance of this field per record |
This field contains paradigmatic lexical relations to other lemmas which have the current lemma in the corresponding field. These relations are, therefore, mutual. The technical implementation is as above for the derivational relations (field #17).
Relevant relations include:
See the detailed treatment of semantic relations in lexicography.
The content of this field goes beyond linguistic semantics, giving information on real-world, especially culture-specific properties of concepts designated. This field may refer to the pertinent section of the 'Situation of the language', esp. the ethnographic situation, for background information.
The field contains a link to an illustrative image file which may be shown automatically or upon mouseclick. Naturally, this will be relevant in connection with the previous field.
The following set of fields contains information on historical and genetic relations of the lemma.
This field is relevant for loans. Its content is the name of a language. This should be taken from a range set.
This field contains information on the etymology of the lemma, like an etymological dictionary. Naturally, this is subservient to information on word formation, which is contained in field #15.
This field contains formally or semantically related words from genetically related languages. Apart from their intrinsic interest, they are often methodologically useful since they may help identify the basic meaning of a lexeme.
An example in a dictionary entry illustrates a specific sense or construction. This local relevance of the example contributes to the reasons for apportioning a polysemous or syntactically heterogeneous item to several records.
An example in a dictionary entry has a double functionality:
For both purposes, it is convenient to select typical examples. A typical example:
In their function #1, examples must be drawn from the corpus. However, the corpus does not always contain suitable (typical) examples. Examples that only have function #2 may be concocted by the lexicographer. The safest way of doing this is by simplifying a corpus example.
The structure of a dictionary example is as follows:
some complex expression containing some form of the lemmai, representedi by a tilde (~) translation into a background language (source of the example in the corpus)
In a monolingual dictionary, the translation is missing.
Each lexical entry (record) has a set of examples. The examples are not given literally, but represented by record IDs of the text corpus.
relational | cross-table linking lemma IDs with IDs of records of the corpus |
---|---|
free field structure | set of instances of the field ‘examples’ per record, each containing a hyperlink to a record of the corpus |
Sinclair (ed.) 1987, ch. 7
The set of fields following here contain information relevant for the researcher in working on the database. Most of it is not destined to be published.
Information on lexical properties, including whole lexical items, may come from published sources. In principle, it may relate to any of the fields of the entry in particular. For that reason, it might seem appropriate to accompany each field of a record with its own bibliographical reference field. However, that would be overdoing it. It suffices to have one field for such references, which contains free text specifying which information comes from which source. The source itself is indicated by the conventional bibliographical short form (like ‘Hale 1985’), implying reference to a bibliographical database that resolves such short forms.
Such lexical information can generally only be published after counterchecking it with one's text corpus and/or informants. After that, the bibliographical reference no longer needs to appear in the published dictionary entry and may pass to the general bibliography.
This field contains any additional information, esp. of a methodological, stylistic, sociolinguistic nature, including the status of the lemma and ungrammatical examples. In contrast to the following field, its content could, in principle, be published (although it seldom will be).
This field contains questions to be investigated and problems to be solved in future lexicographic work, especially fieldwork. This field is directly related to the previous one: a problem is formulated in the present field. Once it is solved, its solution is noted in the field ‘comment’ (so that it may not be forgotten), and the problem is deleted. Thus, the content of this field is destined exclusively for the researcher and never published.
This is the date of last modification, which the DBMS will update automatically.
Only a dictionary that boils down to a word list has no microstructure (see the retrograde dictionary as an example). Traditionally, the minimum microstructure for a general dictionary is:
lemma – definition – examples.
On the other hand, additional kinds of information not mentioned above are easily imaginable: paronomasias, concept history etc.
If there is a team of lexicographers, it may be necessary to add a field ‘editor’ to the methodology section, which holds the initials of the researchers who touched the entry, in chronological order.
The microstructures of a monolingual dictionary and of the L2–L1 volume of a bilingual general dictionary differ only in a few fields:
1 Toolbox comes with an automatic interlinear glossing feature which works rather economically for agglutinative morphology.