Basics
A computer file is a string of bytes. Given that a byte is composed of eight bits, and a bit is a binary digit (which is either set or not set), it follows that there are all in all 28 = 256 different bytes, coded as (20-1=) 0 up to (28-1=) 255. In the beginning of the computer age, only the first seven of these were used to code information, which makes 128 different bytes. The information coded may be anything, e.g. numbers, pixels or text characters (letters). These are alternative interpretations of strings of bytes that depend on the software that handles the computer file. Bytes may also be combined into double bytes or quadruple bytes, for instance to code higher numbers or exotic characters. This, again, is an interpretation of a sequence of bytes. As double bytes were increasingly used for different tasks, even the hardware was designed to handle double bytes instead of single ones.
Some computer files contain text, which is meant to be displayed on the screen and to be printed. Consequently, one interpretation of a byte is as coding a character. With seven bits, 27 = 128 different characters may be coded. The code is a mapping of bytes onto individual characters. The coding of these 128 characters was standardized at the beginning of the computer age in the ASCII (American Standard Code for Information Interchange), the core of the international character set. The ASCII code table is still widely used, e.g. in online bank transfer forms.
ASCII (just as any other such code) only defines which byte means which letter. It does not determine the specific shape that the letter takes on the screen or when printed. That is, instead, defined in terms of vectors (in fonts), which are then rendered by configurations of pixels (on the screen and by printers).1
The ASCII code is subdivided into two sections:
- bytes 0 to 31 code actions to be executed by the displaying software,
- bytes 32 to 127 code those (upper and lower case) letters, numbers and punctuation marks used in English.
The characters coded in the upper ASCII section can be input with any occidental keyboard, including the use of the Shift key for upper case. (Non-English keyboards need the Alt Gr key for some punctuation marks such as { and [.)
Among the bytes 0 to 31, there are such things as ‘carriage return’ (13), ‘linefeed’ (10) etc. Some of them can also be input with a keyboard. For instance, hitting the Return key inputs the two codes ‘13’ and ‘10’. These codes are executed, rather than displayed, by the (screen and printer) software.
A file that consists exclusively of ASCII characters is an ASCII file. It displays (and prints) the same way with any software that is at all meant to display text. And on the other hand, it contains no information that might confound the software.
The upper range
There is an extension of this base character set (leaving behind the limitation to 7 bits) which raises the inventory to 28 = 256 characters. This character set is known as ANSI character set, although the American National Standards Institute did never, in fact, standardize it. It comprises ASCII and, additionally, numerous letters provided with diacritics such as ó, ö, ø etc., common in various European languages. From among these, there is a universal convention comprising the codes 160 – 255. ISO 8859 defines 15 alternate partial character sets for these codes, which are known as ISO 8859-1 – ISO 8859-15. Of these, ISO 8859-1 covers all the characters of the German and some other European varieties of the Latin alphabet.
For the codes 128 – 159, there are two main conventions:
- the international norm ISO 8859 repeats here the codes 0 – 31 of the lower range;
- the Microsoft norm Windows-1252 fills this section with other characters.
Most of the upper-range characters may also be input on occidental keyboards by using the keys for diacritics. If one knows their code, one can also input them by pressing the Alt key while typing the code number. For instance, in many text processors, Alt + 164 writes the ñ.
While the ASCII code table lives on as the initial portion of Unicode, the various norms concerning the upper range of the one-byte character set have been gradually ousted and superseded, since the start of the 21st century, by Unicode.
1 Since a font may assign any fancy shape to the items in a code table, it was possible, even before the advent of Unicode, to output different character sets, including kanji and cuneiform: The characters are always coded by the same 256 bytes. But before they are displayed, a font (earlier: a code table) is chosen which associates each of the codes with a character designed in this particular font, which is then displayed on the screen or the printer.