The following are just some preliminary, informal hints.
Rational choices in the electronic representation of data presuppose their analysis in terms of criteria such as the following:
- nature of the data
- structure of the data
- demands on access and retrieval of the data.
Kinds of data
One major distinction between kinds of data is between digital and analog data. Digital data code everything they represent in terms of figures (inside the computer, in binary digits); analog data bear an iconic resemblance to the phenomena they represent. Digital computers can only process digital data (because a bit is either set or not set, but not 37% set). Analog data therefore have to be digitalized before they enter the digital computer (and they may be reanalogized if they leave it in order to be perceived).
When talking of computerized (digital) data, we will nevertheless distinguish between analog and symbolic data, according to their function for the user:
- Analog (computer) data are digitalized analog (raw) data. They represent phenomena to be perceived (visually or auditorily). Examples include pictures, video and audio records.
- Symbolic data are treated as signs by the user. Examples include textual data and figures; both are called alphanumeric data.
The technical counterpart to these various kinds of data are data types. These are specific constellations and interpretations of sequences of bits and bytes. Here are some examples of data types used at the beginning of the 3rd millennium:
- analog data types: BMP, MPEG, RAW, WAV, MP3 ...
- symbolic data types: integer, long integer, date, one-byte character, two-byte character, (character) string ...
The above analog data types are, at the same time, file types, which reflects the fact that the internal structure of such files is generally opaque to the user.
Symbolic data can always be represented as text; analog data cannot. That means that the user can, within certain limits, choose the data type for his symbolic data. More on this below.
Structure of data
Since the computer only processes bits and bytes, any structure of the data is defined by the user (esp., the programmer). Aspects of the structure concern data types and their configuration. Such a configuration may consist of data of the same or of different types. It may become complex by nesting at several levels. Here are some examples:
- A date, i.e. a certain moment in the history of the world, has the structure of a long real number representing the number of milliseconds since some fixed date (e.g. the big bang). It is transformed into a combination of days, months and years only when displayed to humans.
- A table of figures, e.g. a column or a two-dimensional table of integer figures (which one may wish to total), has the structure of an array of integers.
- Running text as, e.g., the text in this paragraph, has the structure of a string of alphanumeric symbols.
- A bilingual glossary has the structure of a two-dimensional array (of two columns in n rows) of text strings.
- The alphabetic word list of a frequency dictionary has the structure of a two-dimensional array of n rows each of which pairs a text string with a long integer.
Use of data
Alphanumeric data can, in principle, be treated as text (in a string). For instance, a date, instead of being stored as a long real number, may be stored as a text string of the form ‘23/04/2006’. Choice of the data type for storing a piece of data depends on the purpose it is meant to serve and on the kinds of operations that one wishes to execute on the data:
- If a date is only to be displayed as such (e.g., in a letter head), then it suffices to treat it as a text string. If dates are to be arithmetically compared, e.g. in order to find data that are more recent than a certain date, they will be stored as long reals.
- The zip code of an address (in an address database) may be treated as a long integer. However, apart from displaying it, the only operation commonly performed on zip codes is their sorting; and that works not only for integers, but also for alphanumeric strings. Consequently, although a zip code consists exclusively of digits, it is commonly treated as a text string.
Symbolic data are represented formally if the file has a technical structure that corresponds to aspects of the logical structure of the data. If the file has no such technical structure, it is an informal representation of data.
If one assigns a piece of symbolic data to its specific data type (e.g. character, long integer, real, date, boolean etc.), one has chosen a formal representation. The software then guarantees that the data will be processed in a consistent way. For instance, it guarantees that a date will not contain a month figure above 12. If one assigns a piece of symbolic data the data type ‘text string’, that is so far an informal representation. Such a piece of data can become inconsistent and may be operated upon in inappropriate ways by human interference. In scientific contexts, it is therefore advisable to bestow a formal representation on one's data.
Storing data in a text file of some word processor (e.g. an Open Office Text file or an MS Word file) means to opt for an informal representation. The advantage is easy access to the data for unsophisticated users (those who use the computer for games, internet browsing and, at most, as a typewriter). However, even this advantage has narrow limits. For instance, one may store a bilingual glossary (see above for its logical structure) in a text file by using a tabulator between the lemma and its gloss, like this:
digit Ziffer
real Gleitkommazahl
integer ganze Zahl
string Zeichenkette
file Datei
date Datum
... ...
One way of accessing such a glossary is the attempt to find an entry by its English lemma. If, however, the search string happens to appear on the German side, too (like ‘date’ in the example), then the text processor cannot tell the two columns apart and highlights the German word (Datei). Much less can the word processor sort such a list alphabetically or transform it into the converse (German-English) glossary. Therefore this method is inadvisable (no matter how many glossaries in the third millennium AD still have such a form).
A minimum of formal structure would be to arrange the glossary in a two-column table, like this:
date | Datum |
digit | Ziffer |
file | Datei |
integer | ganze Zahl |
real | Gleitkommazahl |
string | Zeichenkette |
... | ... |
Under such presuppositions, even a (contemporary) text processor can perform the above user's operations.
Naturally, the amount of formal structuring that one bestows on one's data also depends on their quantity. If one's data are limited to one Shakespearean sonnet, then sure the most efficient way of guaranteeing flexible data retrieval is to learn it by heart. However, if the data exceed a certain size, it becomes worthwhile to store them in a formal structure. For instance, if the above glossary grows into a dictionary of several thousands of entries, its appropriate place is in a database.
Formal representation of textual data
There used to be a sharp distinction between a text file and a database (file); and it reflected, at the technical level, the logical distinction between informal and formal representation. This, however, has changed drastically since markup languages have come up. These are formal languages which mark the logical structure of text files. Their marks (“tags”) are interspersed with the text data themselves. The effect is that the file is, technically speaking, a text file (it may, in fact, be even a pure ASCII file), but at the same time it may have a formal structure similar to a database.1 Markup languages (like SGML, XML etc.) are not treated here (there is sufficient information on the internet). Since they are some decades younger than databases and (2011) yet being developed, such text files still demand more programming effort from their user and are slower in processing than database files; but that may change soon.
The distinction between a database file and a text file may still be helpful in distinguishing two principal kinds of textual data:
- chunks of linguistic data
- running text
By ‘chunks of linguistic data’ are meant such elements as morphs, words, syllables, proverbs etc., deprived of any context. They are easily stored in a database.
Running text, as in a sentence, paragraph, chapter, book or collection, differs from chunks of linguistic data by one property that is essential from the point of view of computer engineering: it does not have a fixed length and does not consist of a fixed number of elements, but is just sequential in structure. It is just for this reason that the file type ‘text file’ was invented. There are ways of fitting a running text into a database; but that remains a makeshift.
Nevertheless, (taking up the distinction between standard database and free-field-structure database made in the section on databases) storing a running text in a free-field-structure database is, for many linguistic purposes, an appropriate solution. Here is one way of doing it. It is, again, oriented towards use of the program Toolbox (SIL). Implementing it as an XML file would be technically equivalent.2
- Store the text in an ASCII file.
- Split it up into linguistic units of at most sentence size. Very long sentences may be split up in clauses, very long clauses may be split up in phrases. The aim is the production of units of a kind that will often be accessed for consultation, retrieval and export, e.g. in the function of examples.
- Define the field structure of a free-field-structure database in such a way that there is minimally a field for the record ID and a field for a chunk of text of the kind just produced.
- Import the file into the database, as follows: Each linguistic unit constitutes a record, occupying the field just created.
- Number the records consecutively. This should be done in such a way that each record has an unambiguous ID in the entire corpus. The ID may, for instance, consist of an alphabetic part referring to the text title and a numerical part identifying the consecutive number of the record in the running text. This ID occupies the first field of the record.
- Classify all further information to be added to the text unit. For instance, there may be:
- a representation in another script
- tags of various kinds, like an interlinear morpheme gloss, syntactic categories etc.
- a free translation.
- See to it that the DBMS keeps the records ordered according to their ID.
As just said, this is a makeshift. On the one hand, the horizontal structure of the text is dismembered and represented only in the consecutive numbering of the records. If the record IDs got lost, the file would reduce to a collection of chunks of linguistic data. On the other hand, much of the information assembled in a record of this kind has a vertical structure in the sense that, e.g., the interlinear gloss of a morph refers to a certain morph contained in another field of the same record. This kind of vertical alignment in what is essentially a text file is not easily processed algorithmically; Toolbox provides some rudimentary help.
1 The essential difference is that the formal structure of a markup language file is a hierarchy (a meronymy), while the formal structure of a database generally is not.
2 One of the export formats of Toolbox 1.5 is, in fact, an XML file.