Unicode for Language Documentation

Page Index

Unicode and the International Phonetic Alphabet

Linguists who are using the IPA or are dealing with languages that currently have no orthography can make use of the Unicode characters. Indeed, using already encoded characters will assure interoperability with current applications, especially if the glyphs for the characters are represented in fonts that are already widely used (i.e., Arial Unicode MS).

Though most IPA symbols are contained in the IPA Extensions block, characters also come from other blocks (Latin and Greek, for example). These characters are listed at the beginning of the names list in the IPA Extension block. A new Phonetic Extensions block has been added with Unicode 4.0, created primarily for the Uralic Phonetic Alphabet. Note that characters for linguistic transcriptions may also be created from a base character and characters contained in the Spacing Modifier Letters block or Combining Diacritics block.

The Unicode Character Names Index is also useful. It lists the formal character names, alternative character names, and character group names alphabetically.

Developing a Unicode Compliant Orthography

When developing orthography, keep in mind these suggestions that will enable your language to be used with current software, keyboards and fonts:

Adding Characters to Unicode

The Unicode Standard offers a huge array of encoded characters that are able to serve most linguists' needs, and because they are already in Unicode -- which has been adopted by many software and font companies -- they can currently be used in documents. "Inventing" a new character is, however, not recommended, for problems will arise in short and long term accessibility (i.e., sending, receiving, and printing) such non-standard characters. Precomposed forms and ligatures are no longer eligible for encoding.

If you find you need a particular character that is not covered by Unicode, you are advised to work with the Script Encoding Initiative or directly with the Unicode Technical Committee to develop your proposal. Particularly helpful for proposals are copies of pages from books or journals that show a particular character in context (with the bibliographic information included). Though the full approval process can take several years, it will provide a means for others in the future to access the character in the international character encoding standard.

The Script Encoding Initiative, at the Department of Linguistics at the University of California at Berkeley, is dedicated to funding the development of scriplt proposals. It aims to avoid extensive revision of the proposal, or extensive involvement of the Unicode Technical Committee.

More on the Script Encoding Initiative

General guidelines on how to submit a proposal can be found on the Unicode Consortium website.

Precomposed forms

A precomposed form is a character that is made up of a series of characters. For example, kʷ is a precomposed form, made up of a k and a modifier letter, ʷ. In the early days of Unicode, being able to dynamically generate forms with a base character and combining mark was difficult or impossible, and many such precomposed forms were included in older characters sets. As a result, a number of these precomposed forms were added to Unicode. However, current rendering engines and fonts are able to create the base character and combining mark combinations dynamically and the position of the UTC is to rely on this productive method of composition, and to not encode more precomposed forms.

Similarly, ligatures, which are two (or more) glyphs fused together, are also not eligible for character encoding. In general, ligatures can be handled by a font or rendering engine. Six digraph ligatures are included in the IPA block (02A3-02A8). These have been included because they are defined in the IPA for the transcription of the coronal affricates and can be chosen by a transcriber in order to convey a semantic distinction about the phonetic status of the affricate.

For more on precomposed forms, see the FAQ on Ligatures, Digraphs, and Presentation on the Unicode Consortium website, at: http://www.unicode.org/faq/ligature_digraph.html.

The content of this page was developed from Deborah Anderson's presentation at the 2003 E-MELD workshop.

User Contributed Notes
Unicode for Language Documentation
+ Add a comment
  + View comments

Back to top Credits | Glossary | Help | Navigation | Site Map | Site Search