Josef Szakos, Providence U &
Ulrike Glavitsch, Swiss Federal Institute of Technology

Portability, modularity and seamless speech-corpus indexing and retrieval: A new software for documenting (not only) the endangered Formosan aboriginal languages

This paper intends to introduce the SpeechIndexer software interface as applied to the multiple problem of language documentation and use of materials in preparing grammars and textbooks of endangered languages. The special situation of these Austronesian languages, more than a dozen of them, the urgent need for their revival, requires a software which provides a fast indexing capability between a first broad transcription of speech and its unsegmented, searchable digitized recording. The diversity of research circumstances do not favor online databases, but rather call for a small computer software which in a user-friendly way creates indexed but otherwise unaltered files, which can be further edited or indexed after returning from the field.

In combination with other corpus software (OCP, Monoconc) we are building up hundreds of hours of unsegmented natural speech databases in the Formosan languages. Our data collection and analysis concretely involved the Tsou, Kanakanavu, and Saaroa languages spoken by the aboriginal communities in the mountainous regions of the island. In our presentation, we give a detailed account of the problems of technology in their language maintenance efforts. These previously unwritten languages require a system where the primacy of spoken language with all its peculiarities is preserved for teaching purposes, all the more so, because the old speakers are passing away soon.

SpeechIndexer is being developed to respond to the above requirements. The software has two versions, one for the preparation, input of data and one for the search and use of the database. After opening a window, loading the field transcription of speech (in HTML format), in another window we open the .wav file of the sound. The researcher correlates the transcribed words with marked data from the continuous sound file, which is usually over one hour long, and the index is set over the marked stretch. The completed database can be searched then according to morphemes, grammatical tags, etc. on the basis of the raw transcription. The most useful feature of retrieval is the flexibility for the user, the grammar writer, the learner, so that he can seamlessly define the length of the context of the retrieved speech, he can prepare numerous searchable indices, he can practice in learning or save copies for further analysis. This capability makes our interface also very useful for learners who want to 'directly' learn from most authentic recordings.
Since the database results are on CD and DVD, running within the JAVA environment, this portability of our solution may complement the mostly mainframe systems of most language documentation efforts (SOAS, DOBES). Our aim of presenting this very compatible software technology, from its inception in Switzerland to its application for Formosan languages, is also to make it available for researchers and teachers of other languages, also of minorities, and to contribute to the revival of these cultures. We also hope to gain new input from other endangered language communities, so as to increase the versatility of our solution to problems of spoken language documentation and analysis.