Microstructure: structure of a lexical entry

The microstructure of a dictionary – more precisely, of a word list – is the internal structure of any of its entries.

The outline of the microstructure of a dictionary presented below obeys the following principles:

It is oriented towards the description of a language on the background of another language. We are, thus, in principle speaking of a bilingual dictionary, although there may be more than one relevant background language in the working context.
It attends the needs of a fieldworker who describes a language he does not master himself.
It provides a maximum model which will not be fully needed in every single case. It is, however, easier to leave cells provided by the structure blank than to expand the structure after one has started with too simple a model. Likewise, it is simpler to secondarily conflate the contents of two fields in one than to split up the content of a field into two.
The microstructure is explained with a (relational) lexical database in mind.

The complete set of fields, with their numbers, names, explanation, examples and linking relations to other parts of the overall linguistic description, is listed in a table on a separate page.

A. Entry identity

Each entry of a dictionary must be identifiable, both by references from within the dictionary and by references from outside (e.g. the accompanying grammar). If the lexical database is a relational database, then each entry will have an ID in the technical sense. However, since that is a user-unfriendly number, it is normally not suitable for human users. Instead, for the identification of an entry, these rely on the following pieces of information:

Lemma

The lemma is given in standard orthographic representation. Sometimes, the lemma representation itself is used for additional purposes, e.g. to indicate stress, syllable boundaries or word-break points for hyphenation. However, the present comprehensive microstructure provides dedicated fields for such purposes.

If the lemma is a segmental sign, but not an independent word form, it is flanked by a hyphen (or two hyphens) at the side where it is bound, like English cran- and -ize. If it is a morphonological process or a suprasegmental sign, such as German metaphony marking plural or Yucatec Maya high tone marking deagentive formation, a suitable notation has to be developed. Such conventions are explained in the appropriate main section of the dictionary.

Homonym number

Homonyms are, of course, separate entries distinguished by numbers; see the separate section for details. The same goes for the readings of a polysemous entry; see next item.

Sense number

As shown in the section on lexical relations, a relational lexical database provides an elegant way of both keeping the distinction between homonymy and polysemy flexible and of providing its own partial microstructure for each of the readings of a polysemous item. These readings are numbered from 0 (mother entry) to n.

Citation form

Normally, the lemma should be given in the citation form that is traditional in the speech community. In that case, the content of this field is identical to the content of field #1. However, the dictionary also contains lemmas – especially dependent forms (roots, stems, affixes ...) that would not be cited in their naked form by non-linguists. In such cases, the field ‘citation form’ must be filled in. For instance, Yucatec Maya verb roots are never cited as such, but with a default stem extension, for instance: lemma kul- (‘sit’), citation form: kutal. This field is also the reference point for field #6 below.

B. Expression

In every natural language that is at all represented in a dictionary, an expression has at least two significantia, a phonological and a graphemic one.

Phonological representation

The phonological representation is given in IPA. If the citation form differs from the lemma, this field may refer to either of them. In practice, filling in this field may be limited to such cases where the phonological representation is not derivable by rule from the orthographic representation.

Sound

This is a link to a sound file. There are at least two possibilities here:

There is a separate sound file for each lemma. This presupposes that there is a representation – a citation form – of the lemma that may be pronounced naturally in isolation. Consequently, what is pronounced here is the content of field #4.
If there is no such sound file for each lemma, this field may contain one or more pointers to text recordings. Technically, such a pointer identifies a sound file and specifies a start point and an end point contained in that file, in milliseconds. It will typically be a sound file associated with one of the examples (see #31 below).

Phonological variants

This field accounts for idiosyncratic phonological variation. For instance, the first phoneme of economic may be /ɛ/ or /i/.

Other variants are derivable by phonological rule. For instance, the rule of syncope in German predicts that if there is a lemma Wanderer, there will be a variant Wandrer. This rule is stated in the phonology section of the dictionary and then renders superfluous the enumeration of all the variants it generates.

Orthographic variants

The standard graphemic representation already serves as lemma. This field is needed for alternative spellings found in the corpus. For example, the lemma encyclopedia contains encyclopaedia in this field. The lesser the degree to which the language is standardized, the more variation there is in available texts (e.g. of earlier periods); and the fewer dictionaries there are for the language, the more important does it become to display the variants in this field.

Again, such orthographic variants which are based on regular phonological variants do not need to be noted individually.

C. Language variety

The language that constitutes the object of the description including the dictionary determines the corpus set up for the research. It must be defined in the introduction to the dictionary. The definition will necessarily, and explicitly, exclude certain kinds of linguistic variation (cf. the section on variation). For instance, diachronic variation may be limited to one of the stages traditionally defined for the language. All the variation not explicity excluded will be represented in the corpus and have to be categorized. This information is also called diasystematic marking.

The possible contents of the following fields may be defined as range sets.

Dialect

Assuming that the lexicon is not confined to one dialect, dialect appurtenance of the lemma is indicated here. One possible value of this field is ‘standard’.

Sociolect

Again, this will usually be marked only if the lexical item is special. Relevant values include particular age groups or professions.

Style

This concerns style, register, connotations and any kind of pragmatic information. The same restrictions as in the previous fields apply. Relevant values include ‘ritual’, ‘formal’, ‘vulgar’. Cf. Wahrig 1973, ch. 4.

Stage

A lexicon is usually confined to one stage of a language. Other stages may come in in two ways:

If the text corpus includes older texts, it may feature obsolete items.
Because of the presence of diachrony in synchrony, some elements of the inventory of a given state are archaic, others are current, others are fashionable.

D. Structure

This section of the microstructure concerns both the internal structure of the stem representing the lexeme – its inflectional and derivational morphology – and its distribution in syntax and phraseology. The concepts used are those introduced systematically in the grammar coupled with the lexicon; and they will appear in print in the grammar section of the published dictionary. Whenever such a term appears in a lexical entry, reference to that grammar (section) is implied. Cf. Hausmann 1977, ch. 6, Wahrig 1973, ch. 2.

Proper name

What is meant here is the proper name of a grammatical morpheme. For instance, the proper name of the English suffix -ize is ‘verbalizer’, and the proper name of English 's is ‘(Saxon) genitive’. Consequently, the possible contents of this field are unique (i.e. there is no range set), and only a portion of the entries of the lexical database will be specified for this field, viz. the grammatical formatives.

Syntactic category

From among the grammatical categories of a lexical item, this field is dedicated to its syntactic category qua distributional category (for morphological categories see #18). This is understood as a narrow subcategory of a part of speech, e.g. ‘proper noun’, 'transitive verb with additional prepositional complement'. The taxonomy implied here will be explained in the grammar.

A lemma which belongs to diverse syntactic categories is considered polysemous. Each category then constitutes a record.

Morphological structure

This field contains the immediate constituents of the lemma stem; as long as binarism obtains, there are two of them. In the case of a compound, they are two stems; in the case of a derivative, they are a stem and some derivational operator which may or may not be segmental. The items listed there are identical to certain lemmas of the database.

Database solution
relational	set up a cross-table for derivational relations: column 1: ID of complex stem; column 2: ID of first constituent; column 3: ID of second constituent
free field structure	hyperlinks from the constituents to their target records

Word formation

This field contains the technical term for the word-formation process that formed the lemma stem, e.g. bahuvrihi, causative, denominal, deverbal, intensive etc. Possible entries in this field are taken from a range set defined in the grammar, where the word-formation processes of the language are dealt with systematically.

In this field, the last word formation process applied is indicated, i.e. the process which was applied to the components of field #15 to form the stem of the lemma. In the case of a derivationally complex lemma, other word formation processes may have created stems that are part of it, in particular those of field #15. Such processes are not indicated here, since they may be seen by following the links of the latter field.

Derivatives

In this field, the set of lemmas is referenced which have the current lemma in their field #15. Thus, references between the current field and that field are mutual.

Database solution
relational	the cross-table already mentioned for field 15 provides the content both for that field and for the present field
free field structure	series of hyperlinks (converse to the ones of #15) to each of the derivative lemmas

An alternative would be not to have entries for productively derived stems in the dictionary. Then the present field would contain the derivation schemata (of #16) which are applicable to the lemma stem.

Morphological categories

The nature of the inflectional categories to be specified here depends on the language. Examples are noun class, gender, possessive class, verbal voice, inflection class. An inflecting word of a language may fall into diverse morphological categories at once, e.g. voice x, conjugation class y. Some may be syntactically relevant lexical classes such as the gender of a noun, others may be purely morphological classes such as inflection classes. It is practical to set up a separate field for each of these categories.

Irregular inflection

If the stem has inflected forms not derivable by rules pertaining to its inflection class, those forms are listed here. They may be both stem allomorphs such as worse, appearing in this field of the lemma bad, and irregular forms of the inflection paradigm, e.g. oxen appearing in this field of the lemma ox.

Construction

This field contains the syntactic and semantic construction frame (for a verb: its valency frame), including selection restrictions. A case in point are different constructions of complement for complement-taking verbs. This is a specification of the information contained in #14. It should be represented by a formal notation, e.g. [ ~ X ]_Y, where ~ indicates the position of the lemma, X represents relevant syntactic constituents or properties of the context, and Y is the syntactic category of the construction.

Phraseology

This field lists collocations in which the lemma is involved. These may be any kind of fixed expressions, including phrases, idioms and proverbs (cf. Bergenholtz & Tarp, ch. 7.2). If one has decided to bestow lemma status to such complex expressions, then this field contains links to such lemmas.

E. Meaning

Meaning definitions

Semantic information on the lemma is provided in different languages and from different points of view. The sum of the information contained in this subset of fields is highly redundant. It is, however, useful for different kinds of dictionaries to be output from the database.

The basis of the methodology of dictionary definitions is the logic and methodology of the definition in general. See the website devoted to this topic. For the present purpose, the meaning is specified in plain prose. What is easily formalizable about it is relegated to other fields of the lexical entry, in particular 24 and 25.

In specifying properties of an argument of a relational lexeme, care must be taken to distinguish selection restrictions for a possible argument from semantic features of the lexeme itself. For instance:

anziehen (tr.vb.) put on [clothes]
sich anziehen (intr.vb.) put on clothes

In the German example a), ‘[clothes]’ represents a selection restriction concerning the direct object of a transitive verb. In example b) instead, ‘clothes’ represents part of the meaning of an intransitive verb. If the brackets were missing in a), it would not be clear that ‘clothes’ does represent a selection restriction on a direct object.

Native definition

This field contains a specification of the meaning of the lemma in plain prose of the same language, as it would be the case in a monolingual dictionary.

User language definitions

The following fields specify the meaning in a set of languages that may be relevant in the working context. These may include the following languages:

English, because that is the general lingua franca in which either the entire dictionary or, at least, extracts from it may be published;
the regional lingua franca, i.e. the language in which the linguist begins his fieldwork and in which the dictionary may be published, too;
the native language of the linguist, because that is the language he fully controls.

To the second of these fields may be added another one, called ‘native translation’, which contains literally the explanation that the informant gave. This differs from field #22.2, which generally contains the lexicographer's definition.

Since polysemous lexemes are split up over so many database records (see the relevant discussion), each lemma has only one sense or meaning. There is, thus, no necessity to provide for a special substructure of these fields.

Gloss

Whenever a token of the lemma stem appears in a text provided by an interlinear morphological gloss, the gloss is the same for all tokens of that type. This is achieved by retrieving it from the lexicon.¹ In principle, the gloss is provided in the same user languages as before. However, since only linguists are interested in interlinear glosses, an English gloss may suffice.

Care must be taken concerning the relation of the gloss to the lexical item. An item of an interlinear gloss corresponds to an item_i in the text that_i is taken as a whole, i.e. not analyzed morphologically. That is, minimally, a morph. Whenever the lemma of a lexical entry consists of a morpheme, no problem arises for the gloss field. The same holds, in principle, if the lemma is a complex stem (which is not analyzed in the lemma field itself; see below). Problems do arise if the lemma is an inflected (citation) form, because then it will contain inflectional morphemes beside the stem. The gloss, however, must render the stem.

If the lexicon makes the distinction between lemma (field 1) and citation form (#4), this problem does not arise, since lemmata then are stems, and citation forms are taken care of in field 4.
If, for reasons of traditional lexicographic conventions for that language, there are inflected lemmata, then there should be an additional field (‘stem’ or ‘base’) which represents that morphological entity that the gloss corresponds to. To illustrate from a German lexicon:

lemma	werden
stem	werd
gloss	become

Moreover, the morphological gloss does not need to have a morphological substructure. This is important for morphologically complex stems. For instance, the Yucatec Maya lexicon contains the following three items:

lemma	gloss
kim	die
-s	-CAUS
kims	kill

The granularity of the glosses in a text is adjusted to the granularity of the morphological analysis executed there. That means:

If the form of the verb kims occurring in the text is parsed as follows: kim-s, then a gloss will be retrieved from the lexicon for each of the morphs shown, i.e. the gloss will be ‘die-CAUS’.
If the form kims is not parsed in the text, it will be looked up as a whole in the lexicon, and its gloss will be ‘kill’.

Thus, the granularity of the gloss is not decided separately, but is a consequence of the granularity of the morphological analysis applied to the text. This, in turn, depends on the specific purpose pursued with the gloss (see Lehmann 2004, section 3.7). It is therefore inappropriate to provide a gloss ‘die:CAUS’ for the lexical entry kims.

Semantic classes

Each lexical item – at least those with a lexical meaning – belongs to one or more semantic classes. For instance, leg is a body part, spider is an insect, laugh is an expression of emotion. This classificatory information is implicit or even explicit in a good definition (field #22). However, it is highly useful to specify it in a separate field:

While working on the lexical database, one may select all the entries of a given semantic class and compare them. This is the most efficient way to check the lexical database for consistency.
One may output partial or comprehensive onomasiological dictionaries from the database.
Depending on the language, some of these classes may be grammatically relevant and, thus, reappear in the grammar.

The semantic classes form a range set. A proposal for a practical set of semantic classes applicable to many languages is on a separate page.

Even monosemous lemmas may belong to more than one semantic class. For instance, apple is both a plant part and a food (and so are all the fruits).

Database solution
relational	a table of semantic categories and a cross-table connecting lemmas with such categories
free field structure	optionally more than one instance of this field per record

Semantic relations

This field contains paradigmatic lexical relations to other lemmas which have the current lemma in the corresponding field. These relations are, therefore, mutual. The technical implementation is as above for the derivational relations (field #17).

Relevant relations include:

synonymy,
hyponymy/hyperonymy,
cohyponymy: antonymy, complementarity (contradictory contrast), converse relation, minimal contrast,
part-whole relation.

See the detailed treatment of semantic relations in lexicography.

Encyclopedic information

The content of this field goes beyond linguistic semantics, giving information on real-world, especially culture-specific properties of concepts designated. This field may refer to the pertinent section of the 'Situation of the language', esp. the ethnographic situation, for background information.

Picture

The field contains a link to an illustrative image file which may be shown automatically or upon mouseclick. Naturally, this will be relevant in connection with the previous field.

F. Genetic-historical information

The following set of fields contains information on historical and genetic relations of the lemma.

Origin

This field is relevant for loans. Its content is the name of a language. This should be taken from a range set.

Etymology

This field contains information on the etymology of the lemma, like an etymological dictionary. Naturally, this is subservient to information on word formation, which is contained in field #15.

Cognates

This field contains formally or semantically related words from genetically related languages. Apart from their intrinsic interest, they are often methodologically useful since they may help identify the basic meaning of a lexeme.

Examples

An example in a dictionary entry illustrates a specific sense or construction. This local relevance of the example contributes to the reasons for apportioning a polysemous or syntactically heterogeneous item to several records.

An example in a dictionary entry has a double functionality:

It serves as documentary evidence for the descriptive statements (the semantic definition, the grammatical categorization, the stylistic marking etc.).
It helps the dictionary user in employing/understanding the lemma in his texts.

For both purposes, it is convenient to select typical examples. A typical example:

illustrates exactly the local sense or construction in an entry,
represents a frequent collocation,
is as simple as possible, i.e. contains no unnecessary grammatical/semantic/stylistic complications.

In their function #1, examples must be drawn from the corpus. However, the corpus does not always contain suitable (typical) examples. Examples that only have function #2 may be concocted by the lexicographer. The safest way of doing this is by simplifying a corpus example.

The structure of a dictionary example is as follows:

some complex expression containing some form of the lemma_i, represented_i by a tilde (~) translation into a background language (source of the example in the corpus)

In a monolingual dictionary, the translation is missing.

Each lexical entry (record) has a set of examples. The examples are not given literally, but represented by record IDs of the text corpus.

Database solution
relational	cross-table linking lemma IDs with IDs of records of the corpus
free field structure	set of instances of the field ‘examples’ per record, each containing a hyperlink to a record of the corpus

References

Sinclair (ed.) 1987, ch. 7

G. Methodology

The set of fields following here contain information relevant for the researcher in working on the database. Most of it is not destined to be published.

Bibliographical references

Information on lexical properties, including whole lexical items, may come from published sources. In principle, it may relate to any of the fields of the entry in particular. For that reason, it might seem appropriate to accompany each field of a record with its own bibliographical reference field. However, that would be overdoing it. It suffices to have one field for such references, which contains free text specifying which information comes from which source. The source itself is indicated by the conventional bibliographical short form (like ‘Hale 1985’), implying reference to a bibliographical database that resolves such short forms.

Such lexical information can generally only be published after counterchecking it with one's text corpus and/or informants. After that, the bibliographical reference no longer needs to appear in the published dictionary entry and may pass to the general bibliography.

Comment

This field contains any additional information, esp. of a methodological, stylistic, sociolinguistic nature, including the status of the lemma and ungrammatical examples. In contrast to the following field, its content could, in principle, be published (although it seldom will be).

Problems

This field contains questions to be investigated and problems to be solved in future lexicographic work, especially fieldwork. This field is directly related to the previous one: a problem is formulated in the present field. Once it is solved, its solution is noted in the field ‘comment’ (so that it may not be forgotten), and the problem is deleted. Thus, the content of this field is destined exclusively for the researcher and never published.

Date

This is the date of last modification, which the DBMS will update automatically.

Minimum and expanded microstructure

Only a dictionary that boils down to a word list has no microstructure (see the retrograde dictionary as an example). Traditionally, the minimum microstructure for a general dictionary is:

lemma – definition – examples.

On the other hand, additional kinds of information not mentioned above are easily imaginable: paronomasias, concept history etc.

If there is a team of lexicographers, it may be necessary to add a field ‘editor’ to the methodology section, which holds the initials of the researchers who touched the entry, in chronological order.

Mono- vs. bilingual dictionary

The microstructures of a monolingual dictionary and of the L₂–L₁ volume of a bilingual general dictionary differ only in a few fields:

In field 22, the monolingual dictionary provides a definition in the language of the dictionary, while the bilingual dictionary lists the equivalents of the lemma in L₁.
In field 31, the bilingual dictionary couples each example with its L₁ translation, which the monolingual dictionary does not.

¹ Toolbox comes with an automatic interlinear glossing feature which works rather economically for agglutinative morphology.