Provenience of linguistic data

Data have the methodological function of being taken for granted in a research. In order to fulfill this function, they must originate outside the researcher; and although they may owe their sheer existence to the researcher (because he somehow elicited them), their particular properties must be independent of the researcher. Examples formed by the researcher are not data because they fail on this condition.

Generation of linguistic data

Linguistic data may pre-exist the research. In this case, the researcher just records, copies or notes them down.
They may be generated as part of the research.

A speaker may be provided with a stimulus provoking him to generate utterances of some desired kind.
- The stimulus may be a linguistic stimulus, e.g. a sentence of the lingua franca to be translated or a questionnaire to be filled in. This is the most common form of elicitation of linguistic data.
- It may be a non-linguistic stimulus, e.g. a task to be performed or a request to produce a text on some topic. This is a common form of staged communication.
A speaker may be confronted with linguistic data produced by somebody else, including the researcher, and may react to them by an informant judgement.

Relatively uncontrolled generation of data, as in #1 and #2a above, and relatively controlled generation of data, as in #2b above, complement each other in linguistic research. The former guarantees the availability of spontaneous, natural data that illustrate diverse linguistic varieties. The latter is needed to complete a systematic representation of the linguistic system.

Identification of the source of data

In the past, i.e. roughly up to the middle of the 20th century, standards for the identification of the source of linguistic data were rather sloppy. Standards used to be (and still are) relatively strict in the philologies. There a sample drawn from the literature has usually been provided with an indication of the author, work and section quoted. Only in publications intended for the inner circles, this could be omitted because sapienti sat.

In descriptive linguistics, the source of the data used to be left in the dark. Some grammars, including all of the colonial grammars (e.g. San Buenaventura 1684), make no mention of their sources at all. More modern grammars, e.g. Dixon 1972, at least list the informants in the preface, without, however, identifying the sources of the examples and even the texts. Only since about the last quarter of the 20th century, standards for identifying the sources of linguistic data have become stricter (Berez-Kroeker et al. 2018).

There are many good reasons why the source of every text and every example sentence in a linguistic description should be identified. It goes by itself for data quoted from published sources. It is also true for unpublished data:

The speaker who provided them deserves acknowledgement. There are very few cases where he prefers to remain anonymous; needless to say, this wish is to be respected, too. The question of acknowledgement must be seen on the background of scholarly ethics in general. The contribution of a native speaker to a linguistic work may be such that he is not only to be mentioned as the source of its data, but ought even to figure as its coauthor. A relatively early example of this is Bricker et al. 1998.
The researcher may need, some time later, to double-check a piece of data he finds in his notes, which he can only do if he knows its source.
An aspect of a piece of data may be idiolectal or characteristic of a certain dialect or sociolect. The researcher does not know this at the moment he records it; but it needs to be known if an entire language system is to be described.
A user of the data may, for whatever reason, need to know its source.

All of this amounts to the following requirements:

In linguistic fieldwork, make a database of the speakers. For each speaker, enter the following data:
- name
- sex
- year and place of birth
- current domicile, since when
- languages spoken
- ID used in references to him.
If there is a chance of the data being used in a study on variation, also note the provenience of the speaker's nurse (first linguistic role model, typically his mother).
At the moment of recording a set of data, note down the following:
- If the data are collected by more than one researcher, the ID of the researcher
- ID of the informant
- date and place of the recording.
Assign each piece or set of data an identifier. This differs for kinds of data:
- For an entire text:
  - Ask the informant for a title.
  - Derive an identifier of the text from the speaker name and the title, to be used in references to examples drawn from the text.
  - Number the sentences of the text consecutively, starting with the title. In references to excerpts from the text, this is appended to the ID of the text.
- For a set of utterances elicited from the same speaker:
  - Number the utterances consecutively.
  - Compose the reference to an utterance from the informant's ID and the number of the utterance.

For data used in a linguistic work, the following should be done in the published text:

Make two lists of references, one for data (sometimes called ‘primary sources’), another one for scientific publications (sometimes called ‘secondary sources’).
Each primary source entry comprises a reasonable subset of information from the informant database and from the collection of data. Each entry is introduced by the ID of the source as used in the examples; and the list is ordered by these IDs.
The list of secondary sources contains the scientific work used, including specifically those publications which contain data which are quoted in the user's publication.
Every example in the running text is accompanied by an identifier which refers to the list of references. For examples drawn from primary sources, this is the ID of the example composed as explained above. For examples drawn from secondary sources, this is the short reference to the publication (in the standard format of author plus year) followed by the page number.