Taiwan's NDAP Language Archives Project: From bronze inscription texts to Austronesian field recording
Cui-xia Weng*, Ru-yng Chang*, Elizabeth Zeitoun*, Chao-jung Chen*, Derming Juang*, Chu-ren Huang*, and Chin-chuan Cheng#

*Academia Sinica and #City University of Hong Kong

0 Abstract  
The Language Archives Project is part of Taiwan's National Digital Archives Program (NDAP). The project digitizes and archives a wide range of linguistic data, from heritage texts to endangered Formosan languages. The goal is two-fold: both to preserve unique cultural heritages and to provide a comprehensive linguistic infrastructure to support content interpretation of archives. Based on these two goals, the main challenges of this project are: to provide versatile yet uniform presentation of different text types, to account for language change, and to account for language variation.

We take two archives of contrasting characteristics to illustrate how these challenges are met. The Bronze Inscription archives deal with an archaic language preserved in a written form that is significantly different from Modern Chinese writing. The Formosan (i.e. Taiwan Austronesian) archives deal with indigenous languages that are endangered and have no written conventions. We show how OLACMS lays the common ground for content documentation of these contrasting archives.

First, for the Bronze Inscription Archives, the fundamental issue is how to represent the archaic inscribed written form and to establish the direct correspondences with modern writing systems at the same time. We adopt the Intelligent Character System to deal with this issue. Basically, although glyphs vary greatly, the composition of Chinese characters from basic glyph remains regular. Hence an system based on composition of basic glyphs will not only help with diachronic Chinese archives but can also deal with cross-lingual variations (e.g. Korean and Japanese Kanji, new characters from Hong Kong, etc.).

Second, the Formosan languages are indigenous languages in Taiwan that are also thought to be close to the common ancestor of Austronesian languages. The first issue we face is that of establishing orthography, which is solved by the common use of IPA among field linguists. The second issue involves establishing segmentation and tagging standards. The third issue involves audio-representation of field recording. And the last issue involves mapping the lexicon to GIS (geographic information system) to represent language variations and contrasts.

1.0 Introduction  
The Language Archives Project is part of Taiwan's five-year Digital Archives Program (NDAP), which was launched in 2002. The NDAP Language Archives Project, carried out primarily at Academia Sinica, digitizes and archives a wide range of linguistic data, from heritage texts to endangered Formosan languages. In the face of these diverse data types, how to digitize and annotate data properly, and how to provide versatile yet uniform presentation to account for language change and language variation are two main challenges for this project. Two goals of this project are: both to preserve unique cultural heritages and to provide a comprehensive linguistic infrastructure to support content interpretation of archives.

In this paper, we will first briefly describe each sub-project in this project. We then discuss how OLACMS lays the common ground for content documentation of these contrasting archives. In ensuing more detailed discussion, two archives of contrasting characteristics will be focused on to illustrate how we meet the challenges mentioned above. The Bronze Inscription archives deal with an archaic language preserved in a written form that is significantly different from Modern Chinese writing. Hence we take examples from this archive to show how the missing characters problem is solved in our project. Lastly, the Formosan (i.e. Taiwan Austronesian) archives deal with indigenous languages that are endangered and have no written conventions. We offer our example of how to create a multimodal archive for endangered languages. The paper ends with a short conclusion.

2.0 Organization of the NDAP Language Archives Project  
The Language Archives Project has two branch projects on Chinese and Formosan Language archives. The former is further divided into 5 sub-projects. These five sub-projects represent different language usage and historical period of Chinese. Formosan Language archives project aim to preserve the endangered Formosan Austronesian languages with corpora, lexicons and grammars of each language.

The Formosan languages are Austronsian languages. There is a great diversity and complexity among these indigenous languages spoken in Taiwan. Unfortunately, most of these languages are endangered. Hence, we aim not only to preserve their linguistic data and structures, we would also like to preserve some of their cultural heritage through the preservation of their languages. This is why audio story telling, as well as mapping to GIS is used. More detailed on our approaches and experience will be given in a later session.

Among the Chinese archives, "Early Mandarin Chinese Lexicon" is designed as part of the Lexical Knowledgebase (LKB) tracing the historical changes of the Chinese language. The LKB will contain a series of synchronic lexicon from Pre-Qin to Modern Mandarin. The archived materials include written records of lectures, documents of laws and decrees, and fiction and drama of the Ming-Qing period.

The "Lexicon of Pre-Qin Bronze Inscriptions and Bamboo Scripts (LBB)" project aims to build a lexicon of Yin, Zhou, and Chun Qiu bronze inscriptions (from 13th century through 3rd century BC), and the bamboo manuscripts of the Warring States (475BC-221BC). This will be one of the earliest lexicon in the LKB series of Chinese language evolution. For a long time, manual copying and rubbings reproduction are two ways to preserve archaic written languages. However, if a lexical database can be built to preserve these characteristic ideograms, it would make a great progress in archives of ancient Chinese culture. The first difficulty that has to be conquered while developing this kind of database is missing character problem in computers. This project has adopted the Intelligent Character System to solve this problem. We will describe this system in the next session.

The "Modern Chinese Corpus and Treebank" will complete a 10 million-word tagged and balanced corpus for modern Mandarin, as well as complete a grammatically annotated treebank. The emphasis will be on value-added applications such as information search, retrieval, automatic Q & A, and summarization.

The "New Age Corpus: Linguistic Representations and Archives of Multimedia Data" project documents the everyday usage of modern Chinese, such as oral communication, discussion topics, lexicons, gestures and facial expressions, in Taiwan in digital multimedia forms.

The "Southern-Min Archive: A Database of Historical Change in Language Distribution" project is a new addition in 2003. It aims to provide both a historical depth and sociological variation to the archives of Chinese languages in Taiwan.

All subsidiary projects of the Language Archives Project are listed below for easier reference. In addition, the three main components of the Linguistic Anchoring project are also included. The Linguistics Anchoring project is a NDAP technology research and development project. Its goal is to provide the infrastructure for language-based knowledge processing and management. The anchoring reference of this project will transfer the Language archive contents into inter-operable information. The website is at http://LingAnchor.sinica.edu.tw/

Table 1: All subsidiary projects of the Language Archives Project, and the Linguistics Anchoring Project

Language Archives Project
  1. Chinese Languages Archives and Structure
    1. 1.Early-Modern Chinese Lexicon
    2. 2.Lexicon of Pre-Qin Bronze Inscriptions and Bamboo Scripts (LBB)
    3. 3.Modern Chinese Corpus and Treebank
    4. 4.New Age Corpus: Linguistic Representations and Archives of Multimedia Data
    5. 5.Southern-Min Archive: A Database of Historical Change in Language Distribution
  2. Formosan Language Archives
Linguistics Anchoring Project
  1. Linguistic anchoring
  2. Technological support
  3. International standards, such as OLAC, ISLE and ISO etc.

Diagram 1, using the numeral and alphabetical designation of each subsidiary project given above, illustrates the functional structure of the Language Archives Project. The green column at the center represents the standards and tools supporting the digitizing and archiving of language data. Each of the peripheral circles extended from the column represents a language group or variety. The diagram shows how a sharable and reusable set of technologies can be used to support a wide range of language archives. This is exactly the design feature of OLAC, which we adopt and will discuss in the following section.

Diagram 1: Functional Structure of the Language Archives Project
Diagram 1: Functional Structure of the Language Archives Project

3.0 The Application of OLAC to the NDAP Language Archives Project  
The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. Three primary standards serve to bridge the multiple gaps which now lie in between language resources and users: (1) OLACMS: the OLAC Metadata Set (Qualified DC, Dublin Core), (2) OLAC MHP: refinements to the OAI (Open Archives Initiative) protocol, and (3) OLAC Process: a procedure for identifying Best Common Practice Recommendations. On December 2002 there was an OLAC Workshop (IRCS Workshop on Open Language Archives) in Philadelphia which revised the OLAC standards and controlled vocabularies, reviewed OLAC archives and services, and considered proposals for new activities. The metadata format, OLAC extensions, defining a third-party extension and documenting an extension are described in the OLAC Metadata 1.0 version.

The NDAP language archives plan to be OLAC compliant. Three of the resultant archives are already registered with the OLAC repository: Academia Sinica Balanced Corpus or Modern Chinese, Academia Sinica Formosan Language Archive, and Academia Sinica Tagged Corpus of Early Mandarin Chinese. These resources have also been registered at the OLAC-compliant Asian Language Resources Repository (hosted by Tokyo Institute of Technology, yet to be released.) Chang and Huang (2002) reported on the application of OLACMS to the Language Archives Project. They found OLACMS to provide a solid basis that will allow productive and in depth description of our archives with extensions and elaborations. The additional information that we need are Temporal and Geographic Location, as well as textual information such as style, mode, genre, and medium. The suggested additions and elaborations are discussed in section 3.1.-3.4.

3.1 Temporal and Geographic Location  
Since China used a different calendar system until early 20th century, all temporal description of inherited Chinese archives do not conform to the current DC standard. The sub-type of Chinese calendar will then include time, dynasty name, state name, and emperor's reign. We may also add other chronological methods, such as lunar or solar calendar. Take the Academia Sinica Ancient Chinese Corpus for example. Its coverage is Early Mandarin Chinese. The users will be able to refer to a historical calendar and find that the time equals to the dynasties of Yuan, Ming, and Qing. And will be able to convert the time to western calendar using the conversion table provided by Academia Sinica. It offers conversion table for the past 2000 years between Chinese and Western calendars.

When Coverage has a spatial refinement, a location can have different names because of the unit used in cataloguing, as well as because of temporal and linguistic variations. When describing spatial coverage, we need to know more than a place name. E.g. Washington State is different from Washington D.C. and Taipei City is different from Taipei County. Hence we need to define the sub-types of spatial description that include Continent, Country, Administrative Division, Longitude, Latitude, Address, etc..

3.2 Mode and Genre  
Each text in Academia Sinica Balanced Corpus of Modern Chinese (Sinica Corpus) is marked up with five textual parameters: Mode, Genre, Style, Topic and Medium. These are important textual information that needs to be catalogued in metadata.

Table 2: The relation between Mode and Genre of Sinica Corpus (Chinese Knowledge Information Processing (CKIP) 1993)

Table 2: The relation between Mode and Genre of Sinica Corpus (Chinese Knowledge Information Processing (CKIP) 1993)
3.3 Style  
There are four styles that are differentiated in Sinica Corpus: narrative, argumentative, expository, and descriptive.  
3.4 Medium  
Sinica Corpus specifies the media of the language resources as: Newspaper, General Magazine, Academic Journal, Textbook, Reference Book, Thesis, General Book, Audio/Visual Medium, Conversation/Interview.

Table 3: Topic of Sinica Corpus (CKIP 1993)

Table 3: Topic of Sinica Corpus (CKIP 1993)

An example for the adoption follows: for a Sinica Corpus text with a Topic of Arts and a sub-topic of Music.

<topic xml:lang="x-sil-CHN">Art/Music</topic>

4.0 The Intelligent Character System  
The Intelligent Character System, which was developed by the Chinese Document Processing Lab of the Institute of Information Science at Academia Sinica, mainly contains four parts: components, glyphs, operators and production rules. Components are the basic unit of glyph. Take for example. and are two components that compose , and further, can be decomposed into two components: and . Also, can further be divided into and . In this system each component is stored as a file, and each glyph is stored as a folder. Figure 1 gives the glyph structure of , and its component directory. Figure 1: The glyph structure of

Figure 1: The glyph structure of ying2

There are three basic operators to express the structure of a glyph: horizontal, vertical and contained composition. Again, take for example. The composition of and is horizontal composition; and is vertical one; and is contained within . Table 4 gives production rules of to illustrate how glyphs are composed.

Table 4: Decomposition of

Table 4:Decomposition of ying2

Although glyphs vary greatly, the composition of glyphs from basic components is basically according to these three production rules. Hence an encoding scheme based on the composition of basic components will not only help with diachronic Chinese archives but can also deal with cross-lingual variations that also uses glyphs to compose their language characters (e.g. Korean and Japanese Kanji, new glyphs from Hong Kong, etc.).

4.2 User Interface  
Intelligent Character System provides tools and Hanzi (Chinese character) glyph database to let users browse and transform missing characters through Java Applet on the webpage. In addition, users can edit and browse missing characters under the Microsoft Office environment. Besides, a user interface was provided to search and access missing characters. Diagram 2 gives a chart of system's structure to show the processing between tools and the database.

Diagram 2: Database and tools for the Intelligent Character System

Diagram 2: Database and tools for the Intelligent Character System

The basic idea of the intelligent Character System comes from the traditional study on Chinese "forms" of characters, especially the knowledge about glyphs. Through assistance of modern technology, this system not only solves missing character problem in the Lexicon of Pre-Qin Bronze Inscriptions and Bamboo Scripts project, but also improves the performance of the existing Hanzi processing system for possible application, such as data sharing, in the future.

  5.0 The Formosan Language Digital Archive  

The Formosan languages are indigenous languages in Taiwan that are also thought to be close to the common ancestor of Austronesian languages. According to linguistic studies, there are still 15 extant languages (Thao, Kavalan, Pazeh, Atayal, Saisiyat, Bunun, Tsou, Rukai, Paiwan, Puyuma, Amis, Seediq, Saaroa, Kanakanavu, Yami), but declining rapidly. So far, this project has built four out of six Rukai dialects corpora (including Mantauran, Tanan, Maga, and Tona), and can be browsed and searched via internet as well. Others are being added to the archive progressively. It is hoped that by the end of this project, there will be at least nine Formosan languages archived.

The Formosan language archive, which includes both Chinese and English browsing display, contains three main types of information databases: (1) corpora with annotated texts, (2) a language GIS (geographic information system), and (3) four bibliographical databases. These respective databases allow all kinds of research and are briefly introduced below.

  5.1 Linguistic Corpora  
  The collection of the Formosan language corpora includes folktales, narratives, conversations, songs and elicited sentences. The last two categories are not yet available on the web. The structure of an annotated text comprises of the transcription of the original language, divided into paragraphs, sentences; glosses; and free translations. IPA symbols are used to transcribe collected text. This is based on two reasons: (1) there is no standardized writing system for Formosan languages, and (2) IPA is an international standard for transcribing sound recordings in other Archives projects.

Glosses, on the other hands, can be provided at the word level (stems) or at the morphemic level (roots and affixes). For Rukai corpus, morphemic analysis has been adopted for the annotations. The information tagging on each morpheme contains grammatical functions and lexical and syntactic categories. The tags of grammatical functions are according to Formosan Linguistics conventions. A tagset of abbreviations of grammatical functions used in the corpora is shown in table 5. The tags of lexical categories are following the standardization of CKIP (CKIP 1993) but with some reservation. Meanwhile, in addition to annotations, each sentence is heard on an audio output that was digitally recorded in the original file and then transformed into MP3 format. This audio-representation allows users to download recorded sentences, to view and analyze the sound spectrographs, and to process the sound data with sound editing software. Figure 2 gives an example to show how a text is displayed on a webpage interface.

  Table 5: Abbreviations of grammatical functions (for Rukai as a pilot study)
Table 5: Abbreviations of grammatical functions (for Rukai as a pilot study)

  Figure 2: Sentence Display

  Besides, a set of metadata regarding general information of a text, such as text profile, fieldwork activity, and management statements, was also developed. This will facilitate data access and sharing with other similar resources in the future. Figure 3 shows a piece of metadata information of a text.

Figure 3: A metadata set of a text

5.2 The Formosan Geographical Information System  
As for geographic information database, the language distribution search enables users to learn the geographical distribution of each language and dialect. Another search system, comparative word search, allows users to spot the distribution of cognates/non-cognates within the Formosan languages and identify spatial features. In the future, we hope to add these two functions in this database: (1) a system to observe the expansion or decrease of a particular linguistic community over the last hundred years; and (2) an audio recording mapping system.

  5.3 Four Bibliographic Databases  
  The on-line reference search system is provided for user to access Formosan languages information on linguistic references, indigenous teaching references, indigenous literature, and music references. These pieces of information are regularly updated. And, it is hoped that a complete and abundant Formosan language bibliographic databases will be achieved to satisfy linguistic worker's needs.

  6.0 Conclusion  
  Unlike artifact and specimen, the non-physical characteristic of languages is the biggest challenge to language archives. Advances in technology make it possible for us to digitize ancient writing manuscripts, as well as oral or written records of an endanger language. However, preservation goes beyond digitization. In order to prevent the undesirable consequence of the digitized data becoming cold and lifeless digital antiques, we emphasize the reusability, sharability, and accessibility of the archives. This philosophy coincides with the vision and mission of OLAC. In describing the NDAP Language Archives project in Taiwan, we showed digitization of two archives as different as early Chinese documents and endangered Formosan languages can be done under the same project and with the same infrastructure. This is important testimony to the open archives initiative vision. We hope that this work can symbolize a small step to the direction where linguistic and cultural diversity can be accepted and shared by all.

  1. Bird, Steven, Gary Simons, and Chu-Ren Huang. (2001). "The Open Language Archives Community and Asian Language Resources". Paper presented at the 6th Natural Language Processing Pacific Rim Symposium Post-Conference Workshop, Tokyo, Japan. November 30, 2001.
  2. CDP. (2002). Manual for Hanzi Formation Database, (in Chinese), Chinese Document Processing Lab. http://www.sinica.edu.tw/~cdp/zip/hanzi/hzmanual.zip.
  3. Chen, Chao-jung and W. Lin. (2003). "Missing character problem and requirement: A experience from the bronze inscriptions of Shang and Zhou Dynasty". Paper presented at the conference on the Guidelines for Digital Archives Technologies. Taipei, Taiwan. April 30~May 1, 2003.
  4. CKIP. (1993). Analysis of Chinese Part of Speech (Zhung Wen Ci Lei Fen Xi). In Chinese. CKIP Technical Report no. 93-05. Taipei: Academia Sinica.
  5. Lin, Hui-chuan. (1999) Let's talk Mantauran, 1-6. Taipei: Crane Publishing.
  6. Hsieh, Ching-Chun, C. Chang and K. Huang. (1990). "On the Formalization of Glyph in the Chinese Language", ISO/IEC JTCI/SC18/WG8 and AFII Meeting, Kyoto, Feb, 1990.
  7. Huang, Chu-Ren (2003). "Semantic web, WordNet and Ontology: A talk on knowledge management on future's web". Information Management for Buddhist Libraries. 33: 6-21.
  8. --(2003). "Word Knowledge, World Knowledge, and Ontology: Towards a linguistic infrastructure for knowledge representation and knowledge engineering". Presented at Language and Knowledge Representation: Mini-workshop on functional approaches to language. Hsinchu: National Jiao-Tung University. February 26, 2003
  9. Ru-Yng Chang, Chu-Ren Huang. (2002). "OLACMS: Comparisons and Applications in Chinese and Formosan Languages". Paper presented at the 19th COLING 2002 Post-Conference Workshop --The 3rd Workshop on Asian Language Resources and International Standardization. Center of Academia Activities, Academia Sinica. Taipei. Taiwan. August 31, 2002.
  10. Simons Gary. (2002). "SIL Three-letter Codes for Identifying Languages: Migrating from in-house standard to community standard". Paper presented at the International Workshop on Resources and Tools in Field Linguistics (LREC 2002), Las Palmas, Canary Islands. May 26-27, 2002.
  11. Yu, Ching-hua. (2002). "Discussion on the digitization of the Formosan Language Archive: Building up of the architecture of the archive". Paper presented at the Firse workshop on the Digital Library Project, Taipei, July 25-26.
  12. Zeitoun, Elizabeth, and Hui-chuan Lin. (2001). We should not forget the stories of the Mantauran, vol.2: Traditional folktales. MS.
  13. --. (2003). "We should not forget the stories of the Mantauran, vol.I: Memories of the past". Language and Linguistics Monograph Series, No. A4. Taipei: Institute of Linguistics (Preparatory Office), Academia Sinica.
  14. --, Ching-hua Yu, and Cui-xia Weng. (2003). "The Formosan Language Archive: Development of a Multimedia Tool to Salvage the Languages and Oral Traditions of the Indigenous Tribes of Taiwan". Oceanic Linguistics, 42.1

Referential Websites  
  1. The Language Archives Project, 2003. Institute of Linguistics, Academia Sinica. http://languagearchives.sinica.edu.tw/
  2. The National Digital Archives Program, 2003. Taiwan. http://www.ndap.org.tw/
  3. The Early-Modern Chinese Lexicon Project, 2003. http://www.sinica.edu.tw/Early_Mandarin/
  4. Lexicon of Pre-Qin Bronze Inscriptions and Bamboo Scripts (LBB), 2003. http://www.sinica.edu.tw/~lbb/
  5. The Modern Chinese Corpus and Treebank Project, 2003. http://www.sinica.edu.tw/SinicaCorpus/
  6. OLAC http://www.language-archives.org
  7. OAI http://www.openarchives.org
  8. OLAC Metadata (version 1.0), http://www.language-archives.org/OLAC/metadata.html
  9. Western Calendar and Chinese Calendar Conversion Table of Academia Sinica Computing Centre. http://www.sinica.edu.tw/~tdbproj/sinocal/luso.html
  10. The Chinese Document Processing Lab, 2003. http://www.sinica.edu.tw/~cdp/
  11. The Formosan Language Digital Archives: http://www.ling.sinica.edu.tw/Formosan/.
  12. SIL http://www.sil.org/
  13. Ethnologue, Language of the World, http://www.ethnologue.com
  14. Dublin Core http://dublincore.org/

Program Readings Participants
Instructions for Participants
Workshop Homepage
Local Arrangements
Emeld 2001 Emeld 2002 Emeld Homepage