Creating an Archive-Ready Corpus

Page Index


When creating a corpus, materials should be stored and organized with future generations in mind. This page discusses considerations that should be made in naming and labeling materials and in choosing storage formats. A list of ways to find exemplary archival materials is also provided.

Here are the steps you can take to make sure that your corpus will be ready for the archives:

Naming Materials

A convention for naming materials must be established as they are created; this convention must consider how they will be sorted into archival objects. The convention used must be infinitely extensible and must be applied consistently. Keep the following points in mind:

Labeling Materials

It is very important that all field materials be clearly labeled so that they will be accessible for future researchers. Every tape, disk, notebook, diskette, and shoebox must be clearly labeled. Along with the unique identifier, you might add other pertinent information, such as the date, the name of the consultant, and even the name of the language.

Each item should be entered into a database (a spreadsheet will do; even a word processing document is better than nothing). Include the name of the item, the format, dates, and any other data that seems useful. It may also be helpful to print out a paper copy of the database periodically. Paper is an old-fashioned form of back-up and lacks the convenient search capabilities of electronic records, but it has the advantage of being immune to system crashes.


A clear distinction must be made between the formats in which materials may be kept. A master, or archival format, is best for long-term preservation. A presentation format (also called display or access format) is best for presenting documentation on the web. A working format is whichever you find most convenient to use. For example, lexical data might be entered in a spreadsheet, presented online in HTML, and preserved in XML. Digital materials in presentation formats are not archive quality materials. Presentation and working formats will be in wide circulation; they are the formats that people will be most familiar with. Archival format is important to users in the future as well as the discipline.

Archive quality formats are:


In preparing archival material, it can be helpful to use an appropriate example; that is, an archive that covers a similar linguistic or geographical area. Good places to find such examples are:

Finding Help with Digitization

Your institution's library is a good place to go for help with the processes of digitization. Librarians and archivists are increasingly aware of the issues involved in preservation of electronic materials; they frequently have access to the necessary equipment and are generally experienced in its use.

However, it is important to remember that they are (usually) not linguists and may not be aware of the standards needed for preservation of endangered languages documentation. In some cases, they are more accustomed to preparing materials for presentation on the web, rather than long-term storage; for example, they might recommend storing audio in MP3 format, assuming that this will be good enough for your needs, when in fact it audio needs to be digitized, at a minimum, at 16 bit-depth and a sampling rate of 44.1 or 48 kHz, and stored in uncompressed WAV format. Information professionals have knowledge that can be very useful to you in digitizing and preserving your materials, but it is your responsibility to make sure that linguistic standards are met in this process.

The content of this page was developed following the recommendations of the E-MELD working group on Archiving.

User Contributed Notes
Archiving Digital Formats
+ Add a comment
  + View comments

Back to top Credits | Glossary | Help | Navigation | Site Map | Site Search