Digitizing Text
Page Index
Introduction
In addition to audio, video, and image files, language documentation often includes text files, in the form of word lists, lexicons, interlinear glossed text, annotations to audio and video, or field notes. Indeed, the text is arguably the most important part of the language documentation, since it may hold the key to the interpretation of the sounds and images. So it is important to make sure that text is preserved as carefully as information in other media. For text, as for other media, it is also important to distinguish between the archival format, the working format, and the various presentation formats which can be created from the archival copy.
Archival formats
The archival copy of textual content should employ the file format least vulnerable to software and hardware obsolescence, and should incorporate fully defined or documented terminology. There are no text-handling programs which cannot read a plain text (.txt) file; therefore best practice is to make an archival copy in .txt format. Documents created with word processing applications can simply be saved as plain text. Spreadsheets can be saved as Text (Tab delimited); CSV (Comma delimited) is also acceptable. Similarly, many database applications make it possible to save data as plain, tab-delimited text. Regularly saving a file in a plain text format will ensure that it will be machine-interpretable over the long term.
However, humans as well as computers must be able to understand the text. Ideally, structural and semantic information about the content should be provided via XML markup. XML is a tag-based language similar to HTML, except that the tags describe content rather than formatting, e.g., the HTML tag <b> indicates bold whereas an XML tag such as <headword> describes the content of the material.
Character encoding
The interpretation of characters in a text file should also be transparent. Unfortunately, most of the character encoding standards in current use are inadequate to the task of providing an unambiguous encoding for each of the multi-national characters needed by the linguist. For example, an 8-bit encoding standard such as US-ASCII has only 256 code points and thus can unambiguously encode only 256 characters. This is adequate for documenting a language that does not require a special orthography. The Unicode encoding standard, however, provides a unique code point for (ultimately) every character in the world's languages. Thus Unicode is unambiguous, and is the encoding standard universally recommended for material whose long-term intelligibility is critical.
Terminology
Any linguistic terminology used in the text file should also be defined or systematically related to a well-documented terminology set. Currently, there are many different terminology sets in common use. It is not unusual to find similar structures in different languages described in different terms (e.g., verbal morphemes which mark agreement with an argument have been variously called "pronominal affixes," "person markers," "bound pronouns," and "subject markers.") And, in rarer cases, the opposite problem obtains, i.e., the same term may be used to describe different language structures (as when "absolute" is used to indicate a non-possessed form in Semitic, but a transitive object/intransitive subject in ergative languages). For this reason, best practice is to define the terminology within the document or-for greater interoperability-to map the terminology to an ontology of linguistic concepts.
Working format
The working format of a document is the file format which it has while the material is being processed or manipulated by its creator. For example, many linguists use database management software to produce and manipulate a working copy of lexical information. As long as an archival copy is created for long-term preservation the working format of the material can be any form that the linguist finds convenient; however, linguists should be aware that most of the software used to manipulate the data in a working format are proprietary programs prone to rapid obsolescence. Current versions of Microsoft Word, for example, cannot read documents created in Word 1.0; proprietary database management programs have a similarly short life. When software vendors create new versions they are rarely concerned with making the programs backward-compatible. Any document created using proprietary software programs is at risk. Therefore, it is highly important to make archival copies of the data at regular intervals.
Presentation format
One benefit of making an archival copy with XML markup is that XML files are easily transformed into a variety of presentation formats via XSL stylesheets. Using stylesheets it is possible to transform an XML archive file into formats suitable for presentation on the web and in print.
It is also possible to use a form of stylesheet with word processing programs. This will provide consistency in presentation and make it easier to convert the document to another system or format.
Related Links | |
---|---|
Digitizing Text ![]() How to Use Text Lexicons Interlinear Glossed Text What is XML? What is Unicode? GOLD Ontology Stylesheets OCR or Keyboard? |
User Contributed Notes Digitizing Text |
+ Add a comment |
+ View comments |