E-MELD: Electronic Metastructure for Endangered Languages Data

1.     INTRODUCTION

 

Language data is central to the research of a large social sciences community, including not only linguists, but also anthropologists, archeologists, historians, sociologists, and political scientists interested in the culture of indigenous peoples. Members of this research community are currently faced with 2 urgent situations: the number of languages in the world is rapidly diminishing while the number of initiatives to create digital archives of language data is rapidly multiplying as a result of the increasing availability and sophistication of web technology. The latter might seem to be an unalloyed good in the face of the former, but there are 2 ways things may go wrong without adequate collaboration among archivists, linguists, and language engineers. First, a common standard for the digitization of linguistic data may never be agreed upon. And the resulting variation in archiving practices and language representation would seriously inhibit data access, searching, and scientific investigation. Second, [PC1]  standards may be implemented without guidance from the people who best know the range of structural possibilities in human language - descriptive linguists who have documented hundreds of little-known languages. Guidelines which are designed on the basis of well-known western languages will not be adequate to the urgent task of archiving as much linguistic data as possible in the face of widespread language attrition and loss.

 

If digital archives of language data and documentation are to offer the widest possible access and to provide information in a maximally useful form, consensus must be reached about certain aspects of archive infrastructure. As the largest linguistic organization in the world and the discipline's central electronic publication, The LINGUIST List <http://www.linguistlist.org> is organizing a collaborative project with a dual objective: (1) to preserve EL data and documentation and (2) to aid in the development of infrastructure for linguistic archives. One outcome of the project will be a LINGUIST List digital archive housing data from 10 endangered languages (ELs). But the focus on infrastructure will produce other, equally important results. In the first place, The LINGUIST archive will function, not only as a repository, but also as a "showroom of best practice." The archive will offer EL data marked up and catalogued according to community consensus about best practice; it will also disseminate reference material delineating best practice and software tools supporting it. A second outcome will be the establishment on the LINGUIST List site of a central metadata server for the discipline; this server will eventually organize information on all the language-related resources residing at distributed sites, not just information on EL data alone. And a third outcome - perhaps the most important - will be the involvement of a large segment of the linguistics community in the various enterprises underlying the archive and server. Capitalizing upon LINGUIST's high profile within the discipline, we will launch summer workshops and "digital institutes" to formulate and review recommendations of best practice and to train a substantial core of linguists and language archivists in their implementation. At the same time we will use the many avenues of electronic communication open to us to publicize the results of these meetings and solicit input from linguists across the world.

 

Although the data collection efforts will focus initially on endangered languages, the metadata server, the recommendations for best practice, and the distribution of supporting software will have a significant impact on all empirical research in linguistics. Thus the project will add value to all the other language-related projects currently planned or underway.

 

2.     THE PROBLEM

2.1     Language Endangerment

 

Grimes (1996) estimates that there are 6703 languages spoken in the world today. LaPolla (1998) has run statistical analyses based upon census and population estimate figures that show an alarming number of these, perhaps as many as 50%, are in real danger of extinction. Fifty-two per cent of the world's languages are spoken regularly by less than 10,000 people, 28% are spoken by less than 1,000 and 10% by less than 100 (LaPolla 1998). The editors of Ethnologue <http://www.sil.org/ethnologue> estimate that 52 languages have only 1 native speaker left, and 426 are "nearly extinct." By contrast, 49% of the world's population speaks one of 10 major languages (Mandarin, English, Spanish, Hindi, Portuguese, Bengali, Russian, Japanese, French, German) as their mother tongue. Indeed, the impending disappearance of so many languages has become so pressing a problem that it has generated notice in the popular media and considerable public concern (see, for example, Harpers Magazine Aug., 2000; Newsweek International, June 19, 2000; The Independent, July 16, 1999; Daily News, Aug. 19, 1999; Florida Today, May 16, 1999).

 

As scientists, we have a twofold reason to be concerned about this trend in rapid language loss. First, and most importantly, the death of a language or dialect represents a significant loss in knowledge and culture. Language serves as a primary means of cross-generational cultural transmission. And the death of a language may represent a serious impediment to the survival of the community - comprising, as it often does, the loss of the community's traditional poetry, songs, stories, proverbs, laments, and religious rites. Second, the death of a language or dialect represents a serious academic loss (Hale 1996, Woodbury 1993). Studies of linguistic diversity and cross-linguistic comparisons drive much of linguistic theory. Such studies also provide valuable information about population movements, contacts, and genetic relationships; thus they figure as well in research in anthropology, archaeology, history, and ethno-biology. Many (if not most) of the endangered languages have not been well studied or documented. When such a language disappears, then, there are 2 losses: the loss of valuable scientific data, and the loss of the knowledge and worldview it represents.

 

2.2     Digitization efforts

 

The topic of language endangerment has thus become important to linguists of all theoretical backgrounds and areas of specialization, to professionals in related disciplines, and to concerned citizens who value cultural diversity. It is the focus of endangered language organizations across the world, e.g., the Linguistic Society of America Committee on Endangered Languages and their Preservation (CELP), The Foundation for Endangered Languages (FEL), Terralingua (TL), the International Clearing House for Endangered Languages (ICHEL), and The Endangered Language Fund (ELF). For this reason, a number of digital archives of EL data are currently being planned or developed. Among the most prominent within the United States are:

 

Significant projects outside the US include:

Appended is a list of 47 web sites dedicated to ELs, not including the sites listed above or sites which are focused on culture only (see Supplementary Documents, Part 2). Not all of these sites plan to establish a fully developed archive, but they all host collections of texts, grammars, and teaching materials. Thus they are potential sources of EL data and metadata.  

The establishment of multiple archives is to be welcomed, since the magnitude of the task requires distributed effort. No one institution can archive all the important data on all the currently endangered languages - certainly not within the time limits imposed by impending language attrition and by the ongoing deterioration of the existing documentation. Paper, audiotapes, videotapes, and computer diskettes are all prone to degradation and destruction. Moreover, most field notes and grammars currently reside on individual computers, vulnerable to disk crashes as well as file corruption. Some older notes and grammars still exist only in the form of notebooks and file cards. Because language data is difficult to publish commercially, it may be stored negligently or even abandoned once the research based on it has been completed. Even when such material is deposited in conventional libraries, data preservation is not certain, because many libraries can not offer optimum storage conditions. [1]

Digital archiving at distributed sites offers the best hope for preserving this valuable linguistic material. But developing all the infrastructure necessary for a digital archive of language data (including delivery mechanism, formatting guidelines, and supporting software) is a huge task that is beyond the capacity of any single institution to accomplish on its own (Simons, 2000b: 1). And once multiple institutions have set up online archives, resorting to different strategies for designing infrastructure, it will be more difficult to implement any general solution. Without such a common infrastructure, the individual linguist will find it very difficult to identify all the resources pertinent to a given language. To posit an extreme case: the language in question may be classified, or even named, differently in different archives (e.g., Waikurean vs. Guaicuruan, Lappish vs. Sami). The language data may be marked up using different sets of structural tags (e.g., possessive vs. genitive). The texts may have different organizations (e.g., chronological organization vs. frequency organization of the meanings in a dictionary entry). And the files may have different formats because they have been created with incompatible software tools. In this situation, even a linguist with access to resources might not be able to compare them well enough to make reliable linguistic judgments. But--what is perhaps even more disturbing--locating all the relevant material in the first place will be a formidable task. It is unlikely that all the sound and video recordings, texts, grammars, dictionaries, and cultural information pertinent to a given language will ever reside on a single site. And if various archives develop different ways of describing and indexing their resources, no central meta-index can easily be developed. The amount of data will defeat a human librarian, and the different formats will defeat a machine.

 

2.3     The Scope of the Problem

 

It should be emphasized that all of the problems enumerated above arise in the context of archiving any electronic language data, not EL data alone. It is the impending disappearance of so many endangered languages that leads us to focus first on this aspect of the more general language data problem. However, this focus has a distinct - although paradoxical - benefit: the challenging nature of the data set. Many, if not most, ELs have structures which diverge so widely from each other and from those of western European languages that metadata and markup guidelines adequate for these languages will almost certainly be adequate for other language data as well. Thus an attempt to define standards for the digitization of ELs is, in fact, also an attempt to define standards for the digitization of languages in general. And all the facilities developed to provide access to ELs can - and will―be extended to provide access to other linguistic data as well.

3.     Toward a Solution: E-MELD

 

Any attempt to address the language archiving problem must have at least 3 components.
1) Community Involvement. All the different stakeholders in the EL archiving enterprise must be kept fully informed and continually consulted. In particular, we must foster communication between computational linguists and field linguists; since a computational solution developed without the input of descriptive linguists will never become widely accepted. To the extent possible, we must also involve indigenous communities: native speakers of ELs should take part in markup formulation and community leaders in archive design.
2) Flexibility. Any proposed solution must (a) have the capacity to handle legacy data in various formats and (b) allow for some continuing variation in individual practice. Not only will different languages and theories always call for different analytical categories, but different research questions will always call for different types of data manipulation and display.
3) Collaboration. Organizations must pool their resources in light of: (a) the volume of work and the range of expertise needed for a unified solution and (b) the danger that partial, uncoordinated "solutions" will only exacerbate the problem (see 2.2 above).

The E-MELD project has been structured with these 3 requirements in mind. It implements part of a distributed solution proposed in Simons (2000b), which recommends a coordination of effort among the Linguistic Data Consortium (LDC), the Summer Institute of Linguistics (SIL), and The LINGUIST List: The Linguistic Data Consortium will function as a central repository of standards and software (which may be developed elsewhere); the SIL Ethnologue will constitute the standard reference for language classification; and The LINGUIST List will serve as a central repository of metadata, as well as an institutionalized conduit of information between language engineering projects and the linguistics community. [2]

LINGUIST has already taken several steps toward assuming its suggested roles (see Project Preliminaries, 3.7.1 below). The E-MELD project, which involves The Endangered Languages Fund and The University of Arizona, as well as the 3 institutions named above, represents significant, unified progress toward this collaborative goal.

3.1     Project Components

 

In its general outlines the E-MELD project involves: