Creating an Archive-Ready Corpus
When creating a corpus, materials should be stored and organized with future generations in mind. This page discusses considerations that should be made in naming and labeling materials and in choosing storage formats. A list of ways to find exemplary archival materials is also provided.
Here are the steps you can take to make sure that your corpus will be ready for the archives:
A convention for naming materials must be established as they are created; this convention must consider how they will be sorted into archival objects. The convention used must be infinitely extensible and must be applied consistently. Keep the following points in mind:
- Audio recordings should, for example, be labeled by tape number, side letter and item number (rec1A1), or by minidisk and track number, or by speaker initials (GSM1).
- A track marker should clearly delineate separate tracks, e.g. an introduction in the contact language, or the clunk sound made by turning the recorder off between recordings. This will help digitizers, who may not be familiar with the language, separate items on a single tape, CD, or minidisk.
- Try to use similar names for related materials in different formats. For example, if a tape is labeled GSM1, you might name the transcript GSM1 or GSM1T.
It is very important that all field materials be clearly labeled so that they will be accessible for future researchers. Every tape, disk, notebook, diskette, and shoebox must be clearly labeled. Along with the unique identifier, you might add other pertinent information, such as the date, the name of the consultant, and even the name of the language.
Each item should be entered into a database (a spreadsheet will do; even a word processing document is better than nothing). Include the name of the item, the format, dates, and any other data that seems useful. It may also be helpful to print out a paper copy of the database periodically. Paper is an old-fashioned form of back-up and lacks the convenient search capabilities of electronic records, but it has the advantage of being immune to system crashes.
A clear distinction must be made between the formats in which materials may be kept. A master, or archival format, is best for long-term preservation. A presentation format (also called display or access format) is best for presenting documentation on the web. A working format is whichever you find most convenient to use. For example, lexical data might be entered in a spreadsheet, presented online in HTML, and preserved in XML. Digital materials in presentation formats are not archive quality materials. Presentation and working formats will be in wide circulation; they are the formats that people will be most familiar with. Archival format is important to users in the future as well as the discipline.
Archive quality formats are:
- Non-proprietary; that is, the specification of the format is openly accessible to the public;
- Supported by good software tools from multiple vendors;
- Best possible reproductions of the original.
In preparing archival material, it can be helpful to use an appropriate example; that is, an archive that covers a similar linguistic or geographical area. Good places to find such examples are:
- OLAC Archives;
- Relevant publications (e.g. SSILA newsletter);
- Other researchers in the same area (e.g. LINGUIST Person Directory);
- Funding agencies (DOBES, HRELP).
Your institution's library is a good place to go for help with the processes of digitization. Librarians and archivists are increasingly aware of the issues involved in preservation of electronic materials; they frequently have access to the necessary equipment and are generally experienced in its use.
However, it is important to remember that they are (usually) not linguists and may not be aware of the standards needed for preservation of endangered languages documentation. In some cases, they are more accustomed to preparing materials for presentation on the web, rather than long-term storage; for example, they might recommend storing audio in MP3 format, assuming that this will be good enough for your needs, when in fact it audio needs to be digitized, at a minimum, at 16 bit-depth and a sampling rate of 44.1 or 48 kHz, and stored in uncompressed WAV format. Information professionals have knowledge that can be very useful to you in digitizing and preserving your materials, but it is your responsibility to make sure that linguistic standards are met in this process.
Creating a Corpus
How to Find an Archive
How to How to Establish an Archive