Digitizing Text

Page Index

Introduction

In addition to audio, video, and image files, language documentation often includes text files, in the form of word lists, lexicons, interlinear glossed text, annotations to audio and video, or field notes. Indeed, the text is arguably the most important part of the language documentation, since it may hold the key to the interpretation of the sounds and images. So it is important to make sure that text is preserved as carefully as information in other media. For text, as for other media, it is also important to distinguish between the archival format, the working format, and the various presentation formats which can be created from the archival copy.

Archival formats

The archival copy of textual content should employ the file format least vulnerable to software and hardware obsolescence, and should incorporate fully defined or documented terminology. There are no text-handling programs which cannot read a plain text (.txt) file; therefore best practice is to make an archival copy in .txt format. Documents created with word processing applications can simply be saved as plain text. Spreadsheets can be saved as Text (Tab delimited); CSV (Comma delimited) is also acceptable. Similarly, many database applications make it possible to save data as plain, tab-delimited text. Regularly saving a file in a plain text format will ensure that it will be machine-interpretable over the long term.

More on archiving linguistic databases

However, humans as well as computers must be able to understand the text. Ideally, structural and semantic information about the content should be provided via XML markup. XML is a tag-based language similar to HTML, except that the tags describe content rather than formatting, e.g., the HTML tag <b> indicates bold whereas an XML tag such as <headword> describes the content of the material.

More on XML

Australian Partnership for Sustainable Repositories Working Papers

Character encoding

The interpretation of characters in a text file should also be transparent. Unfortunately, most of the character encoding standards in current use are inadequate to the task of providing an unambiguous encoding for each of the multi-national characters needed by the linguist. For example, an 8-bit encoding standard such as US-ASCII has only 256 code points and thus can unambiguously encode only 256 characters. This is adequate for documenting a language that does not require a special orthography. The Unicode encoding standard, however, provides a unique code point for (ultimately) every character in the world's languages. Thus Unicode is unambiguous, and is the encoding standard universally recommended for material whose long-term intelligibility is critical.

More on Unicode

Terminology

Any linguistic terminology used in the text file should also be defined or systematically related to a well-documented terminology set. Currently, there are many different terminology sets in common use. It is not unusual to find similar structures in different languages described in different terms (e.g., verbal morphemes which mark agreement with an argument have been variously called "pronominal affixes," "person markers," "bound pronouns," and "subject markers.") And, in rarer cases, the opposite problem obtains, i.e., the same term may be used to describe different language structures (as when "absolute" is used to indicate a non-possessed form in Semitic, but a transitive object/intransitive subject in ergative languages). For this reason, best practice is to define the terminology within the document or-for greater interoperability-to map the terminology to an ontology of linguistic concepts.

More on GOLD

Working format

The working format of a document is the file format which it has while the material is being processed or manipulated by its creator. For example, many linguists use database management software to produce and manipulate a working copy of lexical information. As long as an archival copy is created for long-term preservation the working format of the material can be any form that the linguist finds convenient; however, linguists should be aware that most of the software used to manipulate the data in a working format are proprietary programs prone to rapid obsolescence. Current versions of Microsoft Word, for example, cannot read documents created in Word 1.0; proprietary database management programs have a similarly short life. When software vendors create new versions they are rarely concerned with making the programs backward-compatible. Any document created using proprietary software programs is at risk. Therefore, it is highly important to make archival copies of the data at regular intervals.

Presentation format

One benefit of making an archival copy with XML markup is that XML files are easily transformed into a variety of presentation formats via XSL stylesheets. Using stylesheets it is possible to transform an XML archive file into formats suitable for presentation on the web and in print.

More on XSL stylesheets

It is also possible to use a form of stylesheet with word processing programs. This will provide consistency in presentation and make it easier to convert the document to another system or format.

More on using stylesheets for word processing

User Contributed Notes
Digitizing Text
+ Add a comment
  + View comments

Back to top Credits | Glossary | Help | Navigation | Site Map | Site Search