An expanded version of a paper originally presented at the annual meeting of the:
      Society for the Study of the Indigenous Languages of the Americas
     2-5 January 2003, Atlanta, Georgia

 
 

Sáliba wordlist project: A case study in best practices for archival documentation of an endangered language

Paul Frank and Gary Simons
SIL International


1. The Problem

Over the past 50 years significant strides have been made in the documentation of the indigenous languages of the Americas. Much of this documentation remains unpublished and is, therefore, inaccessible to others. Some of that documentation (such as audio tape recordings) will eventually become lost due to physical deterioration. Materials can be easily converted into digital form to increase accessibility, but, as Bird and Simons (2003:557) observe, unless steps are taken to ensure its longevity, “much digital language documentation and description becomes inaccessible within a decade of its creation”. They go on to recommend a number of best practices for digital language documentation but note that the recommendations are preliminary and need to be fleshed out by the language resources community. Other sources of best practice recommendations we have consulted are Plichta and Kornbluh (2002) for digitizing audio recordings and MATRIX (2001) for digital imaging.

Sáliba is an endangered language of the Sáliba-Piaroa family spoken by some 3000 people living in the eastern plains of Colombia and in Venezuela. Those over 60 use Sáliba almost exclusively and know little Spanish. Those between 30 and 60 years of age are bilingual in Sáliba and Spanish. Spanish is the mother tongue of those under 30; some of them understand a little Sáliba and know some words and phrases, but they do not use the language regularly. [1]

Given the anticipated loss of Sáliba within the next decades due to language shift, it is important to preserve and provide access to the documentation that exists on this language. That documentation includes a standardized wordlist gathered by SIL linguist Nancy Morse in 1996. The field transcription and an accompanying cassette tape are currently in the files of SIL.

In this paper we treat the Sáliba wordlist as a test case for the best practice recommendations cited above concerning archival documentation of language resources. We report the results of our project to prepare these materials for long-term archiving and to provide present-day access on the Internet via appropriate presentation forms. After giving an overview of our approach in section 2, section 3 enumerates the best practice recommendations we followed and section 4 describes the processes we followed to convert the original materials into the archival and presentation forms. Before offering concluding remarks in section, section 5 gives a cost analysis that could be helpful to others in the process of planning a similar project.

2. Proposed Solution

The Sáliba wordlist materials collected in the field include two items: a 15-page wordlist form and a 40-minute recording on audio cassette. The form presents a standardized wordlist of 375 items based on the Swadesh-Rowe list (reference??). For each item the form provides a prompt in Spanish, an English translation, and a blank for the transcription of the elicited form. for each item. In this case, the form is filled in with handwritten Americanist phonetic transcriptions made by the linguist as each item was elicited. Some items provide alternate pronunciations or additional notes on the gloss. The accompanying audio cassette contains a recording of two Sáliba men repeating the list after the elicitation was completed, with Saúl Humejé reading the Spanish prompts and Angel Eduardo Humejé speaking the Sáliba words. It is a one-track recording using just the right channel of a stereo cassette. The recording was made with a portable <model ???> recorder. <??? something about microphone?>

As Simons (2004) observes, a linguist must do two things to ensure that language documentation will persist far into the future. First, in order to ensure that the materials will still be readable from the software point of view, they must be put into a file format that software of the future will still be able to interpret. Second, in order to ensure that the materials will still be readable from the hardware point of view, they must be deposited with an archive that will ensure that they are migrated as needed to fresh media lest they perish on media that become obsolete or unreadable.

In order to meet the first objective of putting the material into a format appropriate for long-term archiving, we set out to do the following:

In order to meet the second objective of ensuring the on-going availability of the materials far in the future beyond the life of the current hardware and media, our plan was to:

But long-term access is not our only objective. We also want the materials to be available today in an easy-to-access form on the Internet. In order to meet this objective, we also set out to:

3. Best Practice Guidelines

In this project, we were guided by best practice recommendations for digitizing the audio recordings (Plichta and Kornbluh 2002), digitizing the images of the transcription (MATRIX 2001), and for creating digital language documentation and description in general (Bird and Simons 2003). The following tables summarize relevant aspects of these recommendations and indicate the degree to which this project was able to adhere to these recommendations.

Figure 1 addresses guidelines for digitizing audio recordings. Plichta and Kornbluh (2002) recommend 96,000 Hz, 24-bit as a standard for archive-quality digital audio, but also note that 44,100 Hz, 16-bit is adequate for technical purposes. We used the latter standard as our digitizing equipment did not support the higher sampling rate and bit depth.

Figure 1: Recommendations regarding digital audio (Plichta and Kornbluh 2002)

Recommended Best Practice

Sáliba Wordlist Project

Recommended for archival purposes [2]:

  • sample rate: 96,000 Hz
  • bit depth: 24-bit

Lack of appropriate hardware prevented us from following this recommendation.

Sufficient for technical purposes [3]:

  • sample rate: 44,100 Hz
  • bit depth: 16-bit

This is the standard that we followed.

Oversampling delta-sigma A/D converter with dither added prior to sampling

Lack of appropriate hardware prevented us from following this recommendation.

WAV file format [4]

This is the format that we used.

Figure 2 addresses guidelines for digitizing images of textual materials. MATRIX (2001) distinguishes between recommendations for master images and for access images. We followed the former recommendations for generating the archival form and the latter for the presentation form.

Figure 2: Recommendations regarding digital images (MATRIX 2001)

Recommended Best Practice

Sáliba Wordlist Project

Master images:

  • Bit-depth: 8-bit grayscale or 24-bit color

  • Scanning resolution: 300dpi for original documents if smaller than 11”x17”, 200dpi if larger than 11”x17”

  • Image size: size of original document at scan resolution

  • Format: uncompressed TIFF [5]

  • 8-bit grayscale

  • 300 dpi
     

  • Original image size of 8.5” x 11” is preserved

  • Uncompressed TIFF

Access images:

  • Bit-depth: 8-bit grayscale, or 24-bit color

  • Scanning resolution: 72-90 dpi depending on character height

  • Image size: original size, at 72-90 dpi

  • Format:
         For documents smaller than 8.5” x 14”:
    4-bit interlaced GIF for 8-bit grayscale images or 8-bit interlaced GIF for 24-bit color images
         For documents larger than 8.5” x 14”:
    8-bit grayscale JPEG for grayscale images or 24-bit color JPEG, RGB mode for color images

  • 8-bit grayscale

  • 72 dpi

  • Scanned image is 75% of the original (artifact of the resampling process)

  • 8-bit interlaced GIF

Figure 3 addresses the problem of “portability” of digital language resources. This includes the synchronic portability of resources across a multiplicity of present-day computing platforms as well as the diachronic portability of today’s resources to the computing platforms of the future. Bird and Simons (2003) identify seven dimensions of portability and propose best practice guidelines designed to maximize the ability of digital language resources to move across different computing platforms and to remain usable far into the future. Two of the dimensions, citation and preservation, are mostly relevant to the institutions that publish and archive resources. The other five, however, are particularly relevant to the linguists who create language resources. The following table repeats some of the key recommendations in these five areas and describes how the current project has responded.

Figure 3: Recommendations regarding portability of language documentation and description (Bird and Simons 2003)

Recommended Best Practice

Sáliba Wordlist Project

Content:

  • When texts are transcribed, provide the primary recording (without segmenting it into clips).

  • Transcriptions should be time-aligned to the underlying recording in order to facilitate verification.

  • The full elicitation is provided in a 42-minute WAV file.

  • Each response is time-aligned.

Format:

  • Use open formats supported by multiple software vendors.

  • Use Unicode for character encoding.

  • Use a descriptive markup system (preferably XML) for textual information.

  • Provide a human-readable version of the same information in a suitable presentation format.

  • The formats used for archival forms are open: XML (for transcription), WAV (for audio), TIFF (for images).

  • IPA transcriptions are encoded in Unicode.

  • The whole wordlist (including glosses, transcriptions, and time alignments) is represented in an XML file with descriptive markup tags.

  • The XML wordlist is transformed to HTML for presentation in a Web browser.

Discovery:

  • Describe the resource using the metadata standard of the Open Language Archives Community.

  • Make the resource known to the world at large by publishing the metadata description with an OLAC data provider.

  • An OLAC-conformant resource description is supplied.

  • The metadata are published via the OLAC data provider for SIL's Language and Culture Archive [note].

Access:

  • Make resource accessible to all interested users.

  • Publish in such a way that users can access the original materials to manipulate them in novel ways.

  • Presentation form will be published on Web.

  • Archival form of full resources is available by ordering a CD-ROM.

Rights:

  • Make a clear statement of terms of use so that users know what they may do with the material.

  • Identify and protect any sensitivities inherent in the material.

  • The resource description states the materials are copyrighted and available to all under standard terms of Fair Use.

  • There are no known sensitivities and this is stated in the resource metadata.

4. Results and process

Following Simons (2004), we distinguished three forms of the data in this project:

Working form

The form in which information is stored as it is created and edited.

Archival form

The form in which information is stored for access long into the future.

Presentation form

The form in which information is delivered for presentation to the public today.

In the following subsections we first show the results that were achieved for the archival and presentation forms, and then describe the working forms and processes we used to create them.

4.1 The results

It is helpful to begin by showing the final results. It is hoped that understanding the end point will make it easier to understand the process used to get there. Figure 4 gives links to the results. First click on the link under “Presentation form” to see the form that has been developed for publishing the Sáliba wordlist as an interactive web page using today's presentation technologies. On the Sáliba wordlist page, click on the links at the top of the page to see the resource description (or metadata) and the images of the original field transcriptions. Then click on a loud speaker icon to hear the pronunciation of the word as recorded in the field. (The icon is linked to a WAV file; thus your web browser will attempt to play it with the sound program that is set up as the default WAV player on your system.)

Figure 4: The final results

Presentation form

Archival form

After exploring all the features of the presentation form, click on the two links under “Archival form” to see the XML form of the metadata and transcribed wordlist from which the presentation forms were generated. The other two elements of the archival form (the complete recording and the images of the field transcriptions) are too large to make available via the web medium. The complete set of results is available on a CD-ROM from the SIL Language and Culture Archives [note with ordering information].

4.2 The working forms

While the archival and presentation forms were the real aim of the project, the first step in the process was to digitize the materials into a form that could be worked with to produce the other two forms. In the case of the audio and the images, the working form and the archival form were essentially the same. For the audio, the ??? program was used to convert the analog audio signal from the cassette reorder to a digital data stream matching the sampling parameters detailed above in figure 1. Similarly, for the images, the ??? program was used to scan the original wordlist forms from the field into digital images matching the parameters detailed above in figure 2.

In the case of digitizing the transcriptions and aligning them with the recording, a number of tools and transient working forms were employed. Figure 5 lists all the steps in the process that was followed. For each step there is a link to the resulting output and to any script that was used in that step. (Note that none of the files referenced in figure 5 are in a binary format; all are viewable as plain text.)

Figure 5: Processing steps with scripts and results

Process

Tool

Script

Result

Capturing the information in working forms
1. Enter information on the wordlist form into a plain text file using Standard Format tags and Unicode. Microsoft Word 2000   Sáliba_wordlist.txt
2. Transform the Standard Format file to an XML file. CC sf_to_xml.cct wordlist_temp.xml
3. Transform the XML file to a CSV file. XSLT xml_to_csv.xsl
convert_to_csv.bat
wordlist_temp.csv
4. Add approximate start and end times (for aligning with the recording) to the CSV file. Microsoft Excel 2000   wordlist_temp2.csv
5. Align transcriptions with the recording. TableTrans   Sáliba_ag.xml
Generating the archival form
6. Transform annotation graph to archival DTD XSLT   SLC_wordlist.xml
Generating the presentation form
7. Transform archival form to HTML page XSLT   SLC_wordlist.htm
       
       

TableTrans is a data entry tool; thus one could well ask, “Why didn't you just start at step 5?” The basic reason is that although it supports Unicode characters, it lacked a convenient way to enter them. So did Microsoft Excel 2000, or we could have started at step 4. [6] Thus we started with Microsoft Word 2000 which did have a convenient way to create the IPA transcriptions in Unicode. In the remainder of this subsection, we describe in detail the steps we took to create the fully annotated and time-aligned working form of the wordlist data. Such a detailed description of the working forms would not normally be part of a published data set (which would focus on the archival and presentation forms). We include the details here as an aid to others who contemplate carrying out a similar project.

1. Enter information into a plain text file. The first step in digitizing the wordlist and its transcription was to use a word processor to type all the information on the wordlist form into a plain text file. Any number of editors would suffice for this purpose; we used Microsoft Word 2000 since its Insert / Symbol command provided an easy way to enter Unicode characters by selecting one from a chart of all available characters. [7]

Figure 6 gives a sample from the file produced with Word. It uses the so-called “Standard Format” convention to encode the structure of the information. In this convention, tags beginning with the backslash character are used to indicate the function of the information that follows. Note that in figure 6, each line begins with a tag and holds one information element. The first five lines encode information about the elicitation from the top of the wordlist form. The remainder of the file encodes each wordlist item in turn. Four basic tags are used for each item: \num for the number of the item, \spn for the Spanish gloss, \eng for the English gloss, and \slc for the Sáliba response (where SLC is the three-letter identifier for Sáliba from the Ethnologue [Grimes 2000[). The tags for glosses or responses are repeated when there is more than one, as in number 16. Number 26 illustrates how the information was entered when no Sáliba word was elicited for a given item. Number 67 illustrates an additional tag (namely, \not for “note”) that is used to encode notes that the linguist made concerning particular elicited forms. Finally, an \end tag marks the end of the wordlist.

Figure 6: Sample of plain text file for capturing wordlist data
\language Sáliba
\investigator Nancy L. Morse
\date 12-15 Febrero 1996
\speaker Angel Eduardo Humeje [Sáliba utterances]
\speaker Saúl Humeje [Spanish prompts]

\num 1
\spn lengua
\eng tongue
\slc a´ne¯~ne~¯

\num 2
\spn boca
\eng mouth
\slc a¯na~´
...

\num 16
\spn vientre
\spn abdomen
\eng abdomen
\slc te?et?e
\slc te?et?e ampuhu
...

\num 26
\spn muslo
\eng thigh
\slc [no word]
...

\num 67
\spn día
\eng day
\slc d?e?we
\not opuesto de noche
\slc ?ukudia~
\not día de la semana; 24 horas
...

\end
N.B. Need to clean up the Unicode characters above. They've turned into ? somewhere along the line.

Note opposite order of diacritics in form 1.

The file may contain no commas as these will be interpreted as field separators by TableTrans in step 5 of the process (even though they are properly quoted in the CSV format). Thus, it was necessary to do a global search for comma before saving the file, and to modify all of the glosses that contained commas. To save a file from Word 2000 as a Unicode encoded plain text file, use the File / Save As command. For “Save as type”, choose “Encoded text” and then in the dialog that comes up, choose “Unicode (UTF-8)” and then click OK to save the file.

Morse used an Americanist phonetic alphabet for the original transcriptions. The target archival form was to encode the phonetic transcriptions with the IPA block of Unicode; thus it was necessary to convert a number of characters to IPA during data entry. Figure 7 gives a chart showing the changes that were made to the original transcription:

Figure 7: Conversion chart for encoding transcriptions

Original Transcription

IPA Character Used

b

ß

ñ

r

p

x

r

~

r

r

š

ƒ

y

j

ž

Do all the characters in the chart look right? N.B. All the characters in the chart have a FONT tag for the SIL IPA font. We need to give this some thought. Could remove them, and just depend on choosing a Unicode font in the browser. Could leave them in (and use them elsewhere as well) and give instructions somewhere for installing the font, and explain who to set up the default browser font as a backup.

There was a feature of the original transcription that it was not possible to recreate in digital form. A number of words had a ligature joining vowels that Morse considered to be a dipthong, for example, the ai in haixo”di “chief”. We did not find a straightforward way to capture this ligature using standard Unicode characters.

Is this really right? Isn't there a ligature character in IPA, or is it just not in the Unicode set yet?

In the process of checking the transcription against the audio data, it became evident that there is one particular shortcoming in the original transcription. Stops in Sáliba are prenasalized when following nasalized vowels (Morse and Frank 1996:30), but the transcription does not indicate any prenasalization. See (and listen to), for example, items 13 and 38 in the wordlist. Since the focus of this project is capturing the existing language documentation in digital form, no changes were made to the transcription in the process.

2. Transform to an XML file. Our next objective is to convert the Standard Format file to a comma-separated value (CSV) file that can be loaded into Excel and TableTrans. Given the tools we knew how to use, we found the easiest way to accomplish that was to use a simple stream editor to convert the Standard Format to XML markup and then use an XSLT script (in step 3) to transform the XML into the needed form.

CC, for Consistent Changes, is the stream editor we used (SIL 2002). The script (given in figure 5) simply puts a matching start- and end-tag around each field of information. In the process it transforms the text string representing missing data (“[no word]”) to an empty-element tag (<noform/>) and inserts start- and end-tags to encode the hierarchical structure of the information: <wordlist> and </wordlist> to identify the whole document as a wordlist, <item> and </item> to group all the information related to the same wordlist item, and <response> and </response> to group a form and its optional related note. Figure 8 shows the result for three sample items.

Figure 8: Sample of XML transformation of original wordlist data
<wordlist>

<item>
   <number>1</number>
   <spanish>lengua</spanish>
   <english>tongue</english>
   <response><form>a´ne¯~ne~¯</form></response>
</item>
...

<item>
   <number>26</number>
   <spanish>muslo</spanish>
   <english>thigh</english>
   <response><noform/></response>
</item>
...

<item>
   <number>67</number>
   <spanish>día</spanish>
   <english>day</english>
   <response>
      <form>d?e?we</form>
      <note>opuesto de noche</note>
   </response>
   <response>
      <form>?ukudia~</form>
      <note>día de la semana; 24 horas</note>
   </response>
</item>
...

</wordlist>

3. Transform to a CSV file. With the original data in an XML file, the easiest way to convert it to a comma-separated value (CSV) format is to use XSLT in its text output mode. XSLT, for XML Stylesheet Language Transformations, is a language for transforming XML documents into other XML documents or plain text documents. It is an open standard defined by the World Wide Web Consortium and is supported by dozens of vendors (Thompson 2004). We used the XSLT processor that is built into Internet Explorer and can be accessed in a batch file through a command line interface downloadable from Microsoft (Kimbell 2001). The batch file (which performs steps 2 and 3) is given in figure 5.

The transformation (see figure 5) creates a table with seven columns. A row is created for each response in the wordlist. The first two columns are for the start and stop time of the pronunciation of the word within the recording; these are set to 0 at this point. The third column holds the number of the item for the first response associated with a wordlist item, and holds a “+” character for an additional response for the same item. The fourth and fifth columns hold the Spanish and English glosses, respectively. When there are multiple glosses, the script puts them in the same table cell separated by semicolons. The glosses are left blank in the “+” rows. Column six holds the IPA transcription, or “[none]” when no form was elicited. Finally, the note (if present) is in column seven. See figure 9 for a sample of the CSV format.

Figure 9: Rows of comma-separated value file corresponding to figure 8
0,0,1,"lengua","tongue","a´ne¯~ne~¯",
0,0,26,"muslo","thigh","[none]",
0,0,67,"día","day","d?e?we","opuesto de noche"
0,0,+,,,"?ukudia~","día de la semana; 24 horas"

4. Add approximate start and end times. If a CSV file like the above is loaded into TableTrans, the program will jump the display of the recorded wave form back to the beginning (that is, time 0) every time a row of the table is selected. We found it much easier to work with if the table were preloaded with the approximate start and end times of the responses. This was done by loading the CSV into a spreadsheet program, Microsoft Excel 2000, and using it to fill in a series of values automatically.

In the Sáliba recording, the first response occurs at second 23. Dividing the total length of the recording by the actual number of responses, yields an average time between responses of 5.6 seconds. Thus in the first column of the spreadsheet, we entered 23 in the first row and 28.6 in the second row. Once the first two values are entered, click on the first cell and drag to the second to make a selection. Then drag the “fill handle” that appears in the bottom right corner of the selection all the way to the bottom of the table. This fills in a series of numbers in which each entry is the number above plus its difference from the number before it. A response in the Sáliba recording was typically one second long, so the same was done in the second column beginning with the values 24 and 29.6.

Before saving the result it is necessary to inspect all the forms in the sixth column. Some of them may have been mistakenly interpreted by Excel as being a formula. For instance, in the Sáliba wordlist, the responses for items 320 to 323 are suffixes and are transcribed with an initial hyphen. Excel interprets this as introducing a formula and complains of a bad variable name. The fix is to click on the cell, and change the “=” at the beginning of the formula bar to a single quote (apostrophe) which causes the string to be literally quoted. Once these changes are made and verified, use the File / Save command to save the file back out in CSV format.

5. Perform time alignment with TableTrans. Aligning the transcription with the audio recording was one of the more challenging aspects of the project. It would have been possible, but laborious, to find each utterance and note its start and stop times in the master audio file using conventional audio software. The TableTrans [8] program from the Linguistic Data Consortium (Bird and others 2002) provided a better alternative for doing this task. It allows the linguist to follow a “play cursor” through a graphic representation of the wave form as the recording is played back. The linguist may stop the play back at any point, use the mouse to select a region of the wave, and then enter annotations about that particular bit of the recording into a table with user-defined columns. In our application, we defined columns for five annotations: the item number, the Spanish prompt, the English translation, the IPA transcription of the Sáliba utterance, and the additional notes that were entered on the original wordlist forms. In addition to this convenient method for performing the time alignment, TableTrans has a few other features that were critical for this project, namely, the ability to import and export a variety of file formats (including CSV and XML) and the ability to generate individual sound files for each of the annotated segments. We explored a number of sound annotation tools and TableTrans was the only one we found that combined all the needed features.

After starting TableTrans. use File / Open Sound File to load the sound file, and then File / Open Annotation File (in Table Format) to load the CSV file from step 4. When prompted, select UTF-8 as the file encoding, and enter the feature list as: “number spanish english form note”. Now you are set to go through the whole recording to align the sound with the transcriptions. To align a transcription, click on its row of annotations, drag across the sound wave to select the corresponding region, click the play control to verify that the region is correctly identified, and then hit Ctrl-G to set the alignment. Use the File / Save to save your work periodically. Once everything is aligned, save all your work in XML format by using the File / Save Annotation As command and selecting “in AG XML” format. Name the file aligned.xml. The result cell for step 5 in figure 5 shows the file that results. (Note that the file has been edited to comment out the DOCTYPE declaration; this is because the Microsoft XML parser would not validate the document due to the colons in ID values.)

4.3 Generating the archival forms

In the case of the audio and the images, producing the archival form was a matter of saving the digitized information into a file of the format selected for this purpose (see tables 1 and 2). Minimal editing was done to crop out the long periods of silence at the beginning and end of the recording and to crop out excessive white space on the edges of the scanned images. However, no editing was done that would change the data stream (such as hiss reduction on the audio, or contrast enhancement on the images). While a researcher is likely to perform such enhancements to the data when using it for a particular application, such enhancements are not considered best practice for the archival form. This is because other researchers may need a different enhancement to suit their research needs, or, future software may be able to do an even better job with such enhancements. Thus the best archival form is the highest-quality digitization that can be achieved without employing any software enhancements.

Following the various best practice recommendations, the archival form for the digital files are in open formats—XML for the textual data (using the UTF-8 encoding for the Unicode character set), WAV for the audio data, and uncompressed TIFF for the graphic images of the original written documents. Open formats have greater longevity and can be transformed into presentation formats that are more “reader friendly.”

The XML files are derived from the XML output of the TableTrans software that was used to organize the Spanish and English glosses, IPA Sáliba transcription, and time-alignment data to the WAV file. The following is a sample of entries 8-10 in the XML file, showing the item number, Spanish and English glosses, start and stop times for the Sáliba utterance in the master sound file, and the transcription of the utterance. Where there are two alternate Sáliba words or pronunciations for a given prompt, these are recorded in separate “response” blocks of XML data within a single XML “item”, as seen in items 9 and 10.

<item n="8">
   <gloss xml:lang="es">cabeza</gloss>
   <gloss xml:lang="en">head</gloss>
   <response>
      <audio start="58.975000" end="59.775000" />
      <form>id’u</form>
   </response>
</item>
<item n="9">
   <gloss xml:lang="es">frente</gloss>
   <gloss xml:lang="en">forehead</gloss> 
   <response> 
      <audio start="64.250000" end="65.175000" /> 
      <form>pa”e</form>
   </response> 
   <response> 
      <audio start="66.450000" end="67.250000" /> 
      <form>pae</form>
   </response> 
</item> 
<item n="10">
   <gloss xml:lang="es">cabello</gloss> 
   <gloss xml:lang="en">hair</gloss>
   <response> 
      <audio start="71.825000" end="73.000000" /> 
      <form>huwoiwo</form>
   </response> 
   <response> 
      <audio start="73.750000" end="74.775000" />
      <form>huwoiwa</form>
   </response> 
</item> 

The main XML file containing the data is accompanied by a second XML file that documents the metadata for this language resource, following the OLAC standards.

The XML file contains the information from the wordlist as a structured plain text file. It uses the UTF-8 encoding of the Unicode character set. Any software capable of displaying data in UTF-8 should be able to render the phonetic data faithfully, given a Unicode font that includes the IPA block of characters. We used a Unicode version of SIL’s IPA font in this project, as it conforms to our specifications. That font should be generally available in the near future. One problem, however, is that the original transcription calls for stacked diacritics—a acute accent or macron (used to represent pitch) over a tilde (used to indicate nasalization). The diacritics are not always displayed properly; they may be superimposed rather than stacked. This depends on the specific font used and the software used to view the data. Despite this problem, we still consider it best to opt for Unicode encoding of the IPA characters rather than a custom font. This is because the underlying data is preserved for long-term storage following a standard that is true to the original transcription, even if the rendering on some systems may be defective.

The master WAV files in the archival form of this resource is a straight digitization of the audio cassette. There is a second a file in which the tape hiss has been removed, but that file is not part of the set of archival materials.

4.4 Generating the presentation forms

Additionally, we used TableTrans to create individual sound files that correspond to each of the segments that have been identified in the transcription process. These are used in the Web-based presentation form of the data so that the playback of a single utterance involves only the download of a small WAV file.

Clearly, a presentation form of this data set is also needed, and we have chosen to prepare a form of these data for web presentation. For this purpose, an XSLT is used to render the XML file in HTML, small WAV files for each of the utterances are linked to the data for playback, and GIF versions of the scanned images of the original transcription are provided, as they are much smaller in size than the TIFF archival images. The following is a sample of the HTML presentation form of the first 14 words in the wordlist:


Sáliba wordlist

See complete resource description.

See original field transcriptions.



1. 

lengua

tongue

 anene

2. 

boca

mouth

 aha

3. 

labio

lip

 axexe

4. 

diente

tooth

 o”we”e
 owe”e

5. 

nariz

nose

 ixu

6. 

ojo

eye

 paxute

7. 

oreja

ear

 axoxo

8. 

cabeza

head

 id’u

9. 

frente

forehead

 pa”e
 pae

10. 

cabello

hair

 huwoiwo
 huwoiwa

11. 

mentón

chin

 ahatƒu

12. 

barba

beard

 ahixe

13. 

cuello

neck

 okwa

14. 

pecho

chest

 omixe
 omexe

5. Cost analysis

One obstacle to following best practice is that it is more costly than following convenient practice. However, in this world of short-lived storage media and rapid technology change, the path of convenience leads directly to untimely extinction of the results of our field work. Thus, against the cost of following best practice, we must balance the cost of losing the information and all that was invested in collecting it and preparing it for dissemination. In the hopes of providing information that might help others plan future language documentation projects, we offer the following analysis of the costs of this project.

Cost of collecting the materials:

Cost of preparing the materials for archiving:

Cost of long-term preservation:

Discuss costs of CD-ROM, hard disk storage.

6. Conclusion

One success of this project was in preparing the same data in both archival and presentation formats. The presentation format of the textual data can be automatically generated from the archival form, thus avoiding having to maintain two distinct data sets. The archival data is primary: Without care in the preparation of the archival form of the data there is a high likelihood that the information will be unusable within just a few years because of the change of technology. With the current approach, it is possible to have multiple scripts for generating multiple presentation formats. While the archival format stays constant over time, future generations can generate new presentation formats to take advantage of advances in presentation technology.

Transcriptions and recordings exist for a number of other Colombian languages using the same elicitation instrument that was used for these Sáliba data. The TableTrans files prepared in this project could serve as a template for preparing similar digital data sets for these other languages. The new transcriptions can be entered for a given language, the appropriate sound file associated with it, and the time-alignment done. Other tools developed for this current project could be easily adapted to facilitate the preparation of presentation forms of wordlists for these other languages.

In addition, the general principles underlying the development of the digital version of the Sáliba wordlist could be applied to other types of language documentation, especially the distinction between archival and presentation formats, the use of XML and Unicode for the textual data, and the time-alignment of the audio information and textual information.

Notes

1. See Morse and Frank (1996) and Estrada R. (1996) for general information about Sáliba.

2. “Current technology makes it possible to use higher sampling rates and resolution rather inexpensively. One could fairly easily sample at 96,000 Hz and a 24-bit resolution. This would result in a much increased frequency response of the digital waveform—from 0 to 48,000 Hz, and a dramatically improved SNR of 144 dB…Given the technological potential we have at our disposal, the choice of digitization standards appears to be simple: use 96,000 Hz sampling rate and a 24-bit quantization” (Plichta and Kornbluh 2002:6). This recommendation is made in light of the increased use of recordings that have been mastered in digital form and thus have a higher bandwidth than older, analog recordings such as the ones we are working with in this project.

3. “In the technical sense, we need to establish a process that, minimally, reconstructs the entire frequency response of the original while adding as little of the so-called digital noise” as possible. To achieve this goal, it seems to be sufficient to use the 44,100 Hz sampling rate with a 16-bit resolution” (Plichta and Kornbluh 2002:6).

4. “WAV files are uncompressed, thus preserving all bits of information recorded in the AD process. It is also widely used and easy to process and convert to a variety of streaming formats” (Plichta and Kornbluh 2002:7).

5. “Archival or Master Files will require the lossless TIFF format. While not supported by the web, TIFF is a widely supported format for storing bit-mapped images on personal computer hard-drives. It is also the common format for exchanging images between application programs” (MATRIX 2001:10).

6. The latest version of the program, Microsoft Excel 2003, does have an Insert / Symbol command that makes it easy to select and insert any Unicode character.

7. The process described in this step does not work with the latest version of the program, Microsoft Word 2003, since its File / Save As dialog no longer offers the “Encoded text” choice for saving as a UTF-8 encoded file. Fortunately, Excel 2003 now supports the Insert / Symbol command so that it is possible to enter the transcriptions directly into Excel and skip steps 2 and 3.

8. We are grateful to Kazuaki Maeda, part of the TableTrans development team, for prompt programming assistance with bugs which we encountered with the program. Without his help we would not have been able to successfully use TableTrans for this project.

References

Bird, Steven, Kazuaki Maeda, Xiaoyi Ma, Haejoong Lee, Beth Randall, and Salim Zayat. 2002. TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse tools built on the Annotation Graph Toolkit. Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002. Online: http://arxiv.org/abs/cs/0204006. TableTrans and the other related tools can be downloaded by selecting “AGTK Windows” from http://agtk.sourceforge.net/.

Bird, Steven, and Gary Simons. 2003. Seven dimensions of portability for language documentation and description. Language 79(3):557-582. Preprint available at: http://www.ldc.upenn.edu/sb/home/papers/0204020/0204020-revised.pdf

Estrada R., Hortensia. 1996. La lengua sáliba: Clases nominales y sistema de concordancia. Santafé de Bogotá: Colcultura.

Grimes, Barbara F., ed. 2000. Ethnologue: Languages of the world, 14th edition. Dallas: SIL International. Also available online at: http://www.ethnologue.com/

Kimball, Andrew. 2001. Command line transformations using msxsl.exe. Microsoft Corporation. Online: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnxml/html/msxsl.asp

MATRIX. 2001. Digital imaging for archival preservation and online presentation: Best practices. Michigan State University, MATRIX: The Center for Humane Arts, Letters and Social Sciences Online . Online: http://www.historicalvoices.org/papers/image_digitization2.pdf

Morse, Nancy L., and Paul S. Frank. 1997. Lo más importante es vivir en paz: Los sálibas de los Llanos Orientales de Colombia. Santafé de Bogotá: Editorial Alberto Lleras Camargo.

The two references in the text say 1996. Which is right?

Plichta, Bartek, and Mark Kornbluh. 2002. Digitizing speech recordings for archival purposes. Online: http://www.historicalvoices.org/papers/audio_digitization.pdf

SIL International. 2002. CC (Consistent Changes), version 8.1.5. Online: http://www.sil.org/computing/catalog/show_software.asp?id=4

Simons, Gary. 2004. Ensuring that digital data last: The priority of archival form over working form and presentation form. SIL Electronic Working Paper 2004-???. Online: Incomplete draft

Thompson, Henry. 2004. The Extensible Stylesheet Language Family (XSL). World Wide Web Consortium. Online: http://www.w3.org/Style/XSL/