Linguistic data

In discussing the problem of how one can input Unicode characters, a distinction must be made according to the target file:

The target file can hold two-byte (or even four-byte) characters, and the corresponding software treats these as Unicode characters. Some applications, e.g. MS Word, MS Access, Toolbox and others, belong in this category. They display a two-byte character as one character, as they ought to. Likewise, they allow the user to store in Unicode format characters input by normal keybord typing or copied from a Unicode character table.
The target file is designed to hold one-byte characters, and the corresponding software, too, only processes these. ASCII editors, e.g. most HTML Editors by 2011, do not “know” that a certain sequence of two bytes is meant to represent one character (as in Unicode). They display them incorrectly and do not support their input. Although it is technically possible to input two-byte characters into an ASCII file, it is useless unless one has a software that can process such files correctly.

Since only Unicode allows one to use (almost) all the characters and symbols of the world, as long as one does not have files and software of type 1, a solution is needed for referring to unicode characters in an ASCII file. This is essentially done by naming them. HTML (the field generating 99% of this demand) uses HTML entities for that purpose. They are names of characters. Some characters have their proper (abbreviated) English name; e.g. ä for ä and &leacute; for é. However, since just like an ASCII or ANSI character, each Unicode character has its code, all characters can be referred to by spelling out their code. The latter may be represented as a 5-digit decimal or as a 4-digit hexadecimal number. The file then contains the code of the character as ASCII text. For instance, in this HTML file, the character that appears as the right-headed arrow (→), is represented by its hexadecimal code x2192, more precisely by the HTML entity →. It is your browser that transforms this symbol sequence into a Unicode character. The characters and their codes are all enumerated in a set of PDF files available at the website of the Unicode organization.

If one has software that supports Unicode, one may use a character table that contains the characters coded in Unicode format and copy characters from the table into one's file. One such software is UniInput.

Another helpful tool to identify Unicode characeters is here.

Some special characters and symbols frequently used by linguists are available on Lehmann's website.