Portability, Modularity and Seamless Speech-Corpus Indexing and Retrieval: A New Software for Documenting (not only) the Endangered Formosan Aboriginal Languages
Jozsef Szakos, Providence University
Ulrike Glavitsch, Federal Institute of Technology (ETH)

Word Version
SpeechIndexer as a new software addresses the problem of language documentation and sharing of collected materials in reviving endangered languages. The rapid disappearance of Austronesian languages in Taiwan and the urgent need for their revival call for an easy to use software in the field, and compatible with further systems, which has an indexing capability between a first broad transcription of speech and its unsegmented, searchable digitized recording.

SpeechIndexer has two versions, one for the preparation of data and one for the search and sharing of the database. The researcher correlates the transcribed morphemes with the highlighted data from the authentic audio recording and creates indices. He/she can then string-search the database according to morphemes, grammatical tags, etc., depending on the indices prepared. One of the advantages of SpeechIndexer is the flexibility for the user so that he can seamlessly define the length of the context of the retrieved speech, practice in learning or save copies for further analysis.

Upward compatibility is guaranteed, since the database results are on CD and the software runs in the Java environment, which will be around for some time to come.

The portability is given by the small file size of the program itself and the indices generated. Since the original recordings will not be modified during the process, the software can be a good means of direct archiving, too and it can complement the mostly mainframe systems of most language documentation efforts1.

1. Introduction - Corpora and Language Documentation  
The development of corpus linguistics in western countries was mainly based on research needs arising from written language forms [6]. "Father Busa was primarily concerned with his monumental work, the Index Thomisticus, a complete concordance of the works of St. Thomas Aquinas. He began the work of analyzing and collating the Thomistic corpus using electromechanical card-sorting machines in 1949. Father Busa pioneered many of the techniques required to encode a complex textual corpus to produce a comprehensive, analytical, contextual concordance."2 These early linguistic needs included concordances of philosophical, biblical, literary works and the major tasks of programmers consisted in creating KWIC (keyword in context) concordancers (Micro-OCP3, Monoconc4, Paraconc5), search engines and statistical search programs [4]. By the end of the 20th Century the growth of computing power resulted in large size corpora (BNC6, etc.), which by then needed grammatical taggers and more complex search interfaces (SARA7) [1]. All the languages involved in early corpus linguistics were standardized over centuries, dominated by the philological research, lacking the diversity offered by the documentation of languages and dialects without established orthographic systems.

While our main aim is to raise the unwritten languages to the level of well-documented languages of classical/dominant civilizations, we may learn from their experiences and complement these corpora. Such national collections provide balanced records of the language of a certain period. They aim to be representative, but a language documentation should try to be complete and comprehensive, too.

Since the standardization also includes consequent coding (UNICODE8), research was able to develop into the direction of the signup structures (XML9, TEI10). This technology has been successfully evading the problem of representing living speech in corpora. Internet has a comprehensive voice representation, but it needs a thorough adaptation for corpus and documentation purposes11. Although all the corpora contain transcribed speech, dialogues, sometimes even the corresponding recordings are available, but there is hardly enough publicly available speech data which could be KWIC searched, analogously to written records. There are great projects which solved the problem of including speech data in their projects, but they also have their constraints. A good example is the CHILDES12 project with the CLAN software. As long as the user is on the net, accepting the limits of the software, it is good for the linguist, but the complexity may go beyond the means of a field linguist or a language educator. The problem of transplanting the manipulation of speech data on home PC include computer coding and CPU power needed for fast processing of sound. No field linguist, and not less general linguists, would however doubt the importance of living speech for grammatical research and for the preservation of languages.

has always been a need felt for a technology where spoken language can be included in documentation and teaching, and one of the result of this is the IPA which more than a hundred years ago started a quiet revolution in language description. The appearance of tape recorders about fifty years ago was a more visible, hearable revolution, leading to language labs, audiovisual methods. The problem we are facing now, is how to harness the power of computers into a qualitative jump of language archiving which would capture the whole world of the target language for the posterity.

The computer synthesis of speech and the efforts of speech processing, understanding, automatic analysis have remained still unattained goals of many research centers around the world, but they show that there is a necessity to find a bridge between the phonetic (physical) and reduced (written, coded) forms of language communication.

While most of the professional linguists are "speechlessly" facing the disappearance of languages, last dying speakers, being able to catch the diversity and richness of human vocal expression would assumedly help computers to be tuned into the automatic recognition of speech and we would also be able to reliably record the distant languages, making them available for the whole linguistic community, not only for those corporation who can afford the price, like LDC13.

However, we usually needed to stop at the limits of keyboard coded language. Whenever recorded sound comes into the language classroom, it is "frozen" and "segmented". The average linguist or field-worker does have an easy control, a seamless access of the authentic recorded data.

2. Speech Corpora - Linking Methods  
There is still a long time, until computers will be able to process speech as fast as they handle written data now. Before that is done and possibly to supplement it, we need some way of linking the speed of text searching with the richness of the speaker's voice.

One possible method of achieving correspondence is segmenting the sound files and linking the respective forms in HTML format. This would be a suitable method where one has the time of manually doing the transcription, verification and selection task. It would even be possible to do some corpus research on such data, obtaining a ready made segmented element at each search. There are examples of mark-up which follow this method (Academia Sinica Corpus14, Formosan languages). It is also possible to contravene the hindrances of segmenting, if different segment sizes of the same speech are provided in the result. This leads to a repetition of data (word, phrase, sentence, paragraph, story) and the greatest drawback is that the researcher has already superimposed (consciously or unwittingly) a linguistic theory in segmentation with render these recordings unusable for further analysis, losing their originality through the segmentation. This method may be called "archiving" by their proponents, but it is like slicing up old parchments into slips, tagging them, putting them onto shelves and calling this an archive, just because the slices or the tags are easily available, searchable.

In making speech analyzable as a corpus, we need to keep its unsegmented authenticity, so we needed a computer program which would make it possible to have quick, systematic access to any part of the voice recording. At the same time, the user should be able to define the length of segment needed for further processing and analysis, comparison (intonation studies). Therefore we developed the idea of the SpeechIndexer and SpeechFinder programs which would help different linguists in the following ways:
  • Field researchers can index a raw, broad transcription of their recordings and go back to fine details through string search at any later time.
  • Phoneticians, dialectologists can create different indices to the same recording and compare them at any later time.
  • Language teachers, language learners can use authentic data of major languages, but also of disappearing languages and they have easy access to the authentic original recording.
3. Technical Description and Examples of Indexing  
The programs SpeechIndexer and SpeechFinder are written in Java15 and can be executed on almost any platform [5]. SpeechIndexer is intended for use by scientists and offers the full functionality whereas SpeechFinder is to be used in language education and training and has a reduced set of functions. The whole function set of SpeechIndexer is described in the following.

SpeechIndexer starts by presenting a text editor to the user. The text editor window is the main window of the program. The user loads the transcribed text of an authentic recording into the editor. Then, he loads the corresponding audio recording. The first portion of the audio recording is displayed as a digitized signal below the text editor in a separate window - the signal window. A set of tool buttons allows manipulations on the audio signal such as moving the signal to the right and to the left, zooming it in and out, saving a segment of the audio signal to a separate file, and playing a marked segment of the displayed audio signal. A section of the audio signal can be marked by setting start and end positions in the displayed signal by mouse clicks. The section of the audio signal between the selected positions is highlighted as a result. Fig. 1 shows the main window and the signal window where a segment of the audio signal has been marked.

Correspondances between a portion of text and the corresponding audio segment are created as follows. Such correspondances are called indices in the following. The user selects the text, i.e. the words or the sequences of words of interest, in the text window and finds the corresponding audio section in the signal window. Then, the start and end positions of the audio section are marked as precisely as possible where the user can check the correctness of the marked section by playing it. Start and end positions are set correctly if the played audio section contains exactly the marked words in the text. The index between the marked text and the marked audio segment is created by selecting the menu item Index—>set or by pressing a key shortcut. As a result, the marked text is underlined to indicate that a reference from this portion of text to the corresponding section on the audio file exists. Each time the user clicks on a text segment that has an index the corresponding audio section is played to the user. In addition, the audio section is shown as marked in the signal window. This allows the user to easily check and correct created indices. If the audio section of an index is found to be too short or too large the index is cleared and subsequently correctly recreated. Fig. 2 in the Appendix shows how an index is created for a word selected in the main window and a audio section marked in the signal window. Fig. 3 shows how indices are represented in the main window.

The created indices are stored on a separate file. The user selects the menu item Indices saveAs and is prompted for a file name to enter. The program requires that the entered file name has the extension '.si' to clearly distinguish index files from other files. After the user has entered a valid file name for the indices the file name appears in the title bar of the main window and the indices created so far are stored under this file name. The program also stores the name for the text file and the name of the authentic audio recording together with the indices. It has to be noted that the program stores only four items for each index: the starting character position in the text, the end character position in the text, the starting position in the audio file and the end position in the audio file. Thus, the storage required for index files that hold even a large number of indices is very limited. The user may create several index files for the same pair of text and audio file. This may allow him to emphasize different aspects of the authentic recording using different index files. Similarly, it is possible to load different index files in the course of a session. In order to load an index file, both a text file and an audio file need to be loaded in advance. When an index file is loaded the program checks whether the stored filename for the text file on the index file matches the filename of the loaded text file. Similarly, it checks whether the stored audio file name is equal to the loaded audio file. The program emits an error if a mismatch is detected. If no mismatch has been detected it removes previously existing indices and loads the indices of the new index file.

Another important function is the ability to search for strings in the text. The occurrences of the string are highlighted and the user may play them by mouse click if an index exists. This way, the user may compare the pronunciation of the same word or the same sequence of words in different contexts. Fig. 4 in the Appendix shows the use of the search function. The user enters the search string in a separate window from where he can always find the next occurrence of the search string relative to the current position.

As mentioned above, the SpeechFinder program contains a subset of the functions described above. Users of SpeechFinder will work with existing index files. The signal window is not displayed in SpeechFinder and prospective users can neither create nor modify indices. This makes SpeechFinder particularly suited for language training where both teachers and students can study the language from authentic recordings [6].

Both programs SpeechIndexer and SpeechFinder are very compact in their current form. They occupy each less than 100 KB of secondary storage.

4. Applying Speechindexer to Austronesian Aboriginal Languages  
Since one of the authors has devoted the past decades of his life to the documentation of Formosan aboriginal languages, it is understandable that the first application and testing of these programs is done with these recordings.

There are still 15 Austronesian languages spoken in the mountains of Taiwan (comprising 40 dialects), two thirds of them disappearing during our time. The tribes have been colonized by Japanese rulers for the first half of the past century and have been further sinicized under Chinese rule.

The languages chosen for illustration are Saaroa, Kanakanavu and Tsou. The first two are only spoken by a dozen elderly people, while Tsou still "boasts" of about 2000 fluent speakers.

There are about 350 hours of recorded Kanakanavu materials available as a corpus. They are partially transcribed and we aim at the creation of a hundred hours of selected speech materials in SpeechIndexer format.

There are almost 900 hours of Tsou materials, and we hope to mark up about 300 hours with SpeechIndexer.

We have also created about 500 hours of Saaroa speech recordings, out of which we intend to choose the best 150 hours to be indexed and made available in the near future.

Children are beginning to learn the language at school and there are revitalization efforts going on.

For some of Formosan languages it will not be possible to make those hundreds of recording hours, but we hope to complete at least well annotated examples of a DVD size for each one.

Textbooks and myth collections in these languages are under preparation. The samples in our paper are taken from these stories.

5. Future Tasks in Software Development  
  • Creating an editor for index files and establishing a standard for integration in word-voice processing applications: Once the HTML or the index file has been modified, this will make corrections easy to perform.
  • Concatenation of sound files for search purposes: For practical searches we were thinking of linking a length of sound easily held on a DVD. This should be enough for writing reference grammars of languages, as well as making middle-level textbooks.
  • Integrating SpeechIndexer in database applications which ensure the multiple indexing of files.
  • Cross-referencing and annotation functions in the index file, ultimately helping to cross-reference the voice database.
  • Integrating SpeechIndexer in concordancing programs, where other aspects of corpus could be researched, accompanied by the living voice.

1. Consult e.g. the software tools of DOBES: http://www.mpi.nl/tools/elan.html
2. http://www.smu.edu/bridwell/publications/ryrie_catalog/xiii_1.htm
3. http://users.ox.ac.uk/~ctitext2/resguide/resources/o125.html
4. http://www.ruf.rice.edu/~barlow/mono.html
5. http://www.ruf.rice.edu/~barlow/parac.html
6. http://www.natcorp.ox.ac.uk/
7. http://www.natcorp.ox.ac.uk/SARA/index.htm
8. http://www.unicode.org/
9. http://www.w3.org/XML/
10. http://www.tei-c.org/
11. http://www.w3.org/Voice/ and http://www.w3.org/TR/voicexml21/
12. http://childes.psy.cmu.edu/ and http://childes.psy.cmu.edu/manuals/CLAN.pdf .
13. http://www.ldc.upenn.edu/ and http://www.ldc.upenn.edu/annotation/specom.html
14. http://www.ling.sinica.edu.tw/formosan/
15. http://java.sun.com/

[1] Aston, G. and Burnard, L. 1998. The BNC Handbook. Edinburgh: Edinburgh University Press.

[2] Barnbrook, G. 1996. Language and Computers. Edinburgh: Edinburgh University Press.

[3] Biber, D. , Conrad, S., and Reppen, R. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.

[4] Hockey, S. 1988. Micro-OCP User Manual. Oxford: Oxford University Press.

[5] Mason, O. 2000. Programming for Corpus Linguistics: How to do text analysis with Java. Edinburgh: Edinburgh University Press.

[6] McEnery T., and Wilson, A. 1996. Corpus Linguistics. Edinburgh: Edinburgh University Press.

[7] Thomas, J. and Short, M. (eds.) 1996. Using Corpora for Language Research. London: Longman.

[8] Wichmann, A., Fligelstone, S., McEnery, T., Knowles, G.(eds.) 1997. Teaching and Language Corpora. London: Longman.

A. Appendix  
Fig. 1: Creating an index between a piece of text and the corresponding audio segment.

Fig. 2: Creating an index between a marked portion of text and a section of the audio signal.

Fig. 3: Transcribed text with indices (references to audio segment) denoted by underlines

Fig 2: Searching a string in a text that has been fully indexed. The found occurrence of the search string is highlighted (the first occurrence of 'kuri' on the third line that contains Latin letters) and upon clicking on the highlighted string the audio segment of the index is played and is shown in the signal window.


Papers and Handouts
Instructions for Participants
Working Groups
Local Arrangements
E-MELD 2001 E-MELD 2002 E-MELD 2003 E-MELD Homepage