E-MELD: Electronic Metastructure for Endangered Languages Data
Language data is central to the research of a large social sciences community, including not only linguists, but also anthropologists, archeologists, historians, sociologists, and political scientists interested in the culture of indigenous peoples. Members of this research community are currently faced with 2 urgent situations: the number of languages in the world is rapidly diminishing while the number of initiatives to create digital archives of language data is rapidly multiplying as a result of the increasing availability and sophistication of web technology. The latter might seem to be an unalloyed good in the face of the former, but there are 2 ways things may go wrong without adequate collaboration among archivists, linguists, and language engineers. First, a common standard for the digitization of linguistic data may never be agreed upon. And the resulting variation in archiving practices and language representation would seriously inhibit data access, searching, and scientific investigation. Second, [PC1] standards may be implemented without guidance from the people who best know the range of structural possibilities in human language - descriptive linguists who have documented hundreds of little-known languages. Guidelines which are designed on the basis of well-known western languages will not be adequate to the urgent task of archiving as much linguistic data as possible in the face of widespread language attrition and loss.
If digital archives of language data and documentation are to offer the widest possible access and to provide information in a maximally useful form, consensus must be reached about certain aspects of archive infrastructure. As the largest linguistic organization in the world and the discipline's central electronic publication, The LINGUIST List <http://www.linguistlist.org> is organizing a collaborative project with a dual objective: (1) to preserve EL data and documentation and (2) to aid in the development of infrastructure for linguistic archives. One outcome of the project will be a LINGUIST List digital archive housing data from 10 endangered languages (ELs). But the focus on infrastructure will produce other, equally important results. In the first place, The LINGUIST archive will function, not only as a repository, but also as a "showroom of best practice." The archive will offer EL data marked up and catalogued according to community consensus about best practice; it will also disseminate reference material delineating best practice and software tools supporting it. A second outcome will be the establishment on the LINGUIST List site of a central metadata server for the discipline; this server will eventually organize information on all the language-related resources residing at distributed sites, not just information on EL data alone. And a third outcome - perhaps the most important - will be the involvement of a large segment of the linguistics community in the various enterprises underlying the archive and server. Capitalizing upon LINGUIST's high profile within the discipline, we will launch summer workshops and "digital institutes" to formulate and review recommendations of best practice and to train a substantial core of linguists and language archivists in their implementation. At the same time we will use the many avenues of electronic communication open to us to publicize the results of these meetings and solicit input from linguists across the world.
Although the data collection efforts will focus initially on endangered languages, the metadata server, the recommendations for best practice, and the distribution of supporting software will have a significant impact on all empirical research in linguistics. Thus the project will add value to all the other language-related projects currently planned or underway.
Grimes (1996) estimates that there are 6703 languages spoken in the world today. LaPolla (1998) has run statistical analyses based upon census and population estimate figures that show an alarming number of these, perhaps as many as 50%, are in real danger of extinction. Fifty-two per cent of the world's languages are spoken regularly by less than 10,000 people, 28% are spoken by less than 1,000 and 10% by less than 100 (LaPolla 1998). The editors of Ethnologue <http://www.sil.org/ethnologue> estimate that 52 languages have only 1 native speaker left, and 426 are "nearly extinct." By contrast, 49% of the world's population speaks one of 10 major languages (Mandarin, English, Spanish, Hindi, Portuguese, Bengali, Russian, Japanese, French, German) as their mother tongue. Indeed, the impending disappearance of so many languages has become so pressing a problem that it has generated notice in the popular media and considerable public concern (see, for example, Harpers Magazine Aug., 2000; Newsweek International, June 19, 2000; The Independent, July 16, 1999; Daily News, Aug. 19, 1999; Florida Today, May 16, 1999).
As scientists, we have a twofold reason to be concerned about this trend in rapid language loss. First, and most importantly, the death of a language or dialect represents a significant loss in knowledge and culture. Language serves as a primary means of cross-generational cultural transmission. And the death of a language may represent a serious impediment to the survival of the community - comprising, as it often does, the loss of the community's traditional poetry, songs, stories, proverbs, laments, and religious rites. Second, the death of a language or dialect represents a serious academic loss (Hale 1996, Woodbury 1993). Studies of linguistic diversity and cross-linguistic comparisons drive much of linguistic theory. Such studies also provide valuable information about population movements, contacts, and genetic relationships; thus they figure as well in research in anthropology, archaeology, history, and ethno-biology. Many (if not most) of the endangered languages have not been well studied or documented. When such a language disappears, then, there are 2 losses: the loss of valuable scientific data, and the loss of the knowledge and worldview it represents.
The topic of language endangerment has thus become important to linguists of all theoretical backgrounds and areas of specialization, to professionals in related disciplines, and to concerned citizens who value cultural diversity. It is the focus of endangered language organizations across the world, e.g., the Linguistic Society of America Committee on Endangered Languages and their Preservation (CELP), The Foundation for Endangered Languages (FEL), Terralingua (TL), the International Clearing House for Endangered Languages (ICHEL), and The Endangered Language Fund (ELF). For this reason, a number of digital archives of EL data are currently being planned or developed. Among the most prominent within the United States are:
Significant projects outside the US include:
The establishment of multiple archives is to be welcomed, since the magnitude of the task requires distributed effort. No one institution can archive all the important data on all the currently endangered languages - certainly not within the time limits imposed by impending language attrition and by the ongoing deterioration of the existing documentation. Paper, audiotapes, videotapes, and computer diskettes are all prone to degradation and destruction. Moreover, most field notes and grammars currently reside on individual computers, vulnerable to disk crashes as well as file corruption. Some older notes and grammars still exist only in the form of notebooks and file cards. Because language data is difficult to publish commercially, it may be stored negligently or even abandoned once the research based on it has been completed. Even when such material is deposited in conventional libraries, data preservation is not certain, because many libraries can not offer optimum storage conditions. [1]
Digital archiving at distributed sites offers the best hope for preserving this valuable linguistic material. But developing all the infrastructure necessary for a digital archive of language data (including delivery mechanism, formatting guidelines, and supporting software) is a huge task that is beyond the capacity of any single institution to accomplish on its own (Simons, 2000b: 1). And once multiple institutions have set up online archives, resorting to different strategies for designing infrastructure, it will be more difficult to implement any general solution. Without such a common infrastructure, the individual linguist will find it very difficult to identify all the resources pertinent to a given language. To posit an extreme case: the language in question may be classified, or even named, differently in different archives (e.g., Waikurean vs. Guaicuruan, Lappish vs. Sami). The language data may be marked up using different sets of structural tags (e.g., possessive vs. genitive). The texts may have different organizations (e.g., chronological organization vs. frequency organization of the meanings in a dictionary entry). And the files may have different formats because they have been created with incompatible software tools. In this situation, even a linguist with access to resources might not be able to compare them well enough to make reliable linguistic judgments. But--what is perhaps even more disturbing--locating all the relevant material in the first place will be a formidable task. It is unlikely that all the sound and video recordings, texts, grammars, dictionaries, and cultural information pertinent to a given language will ever reside on a single site. And if various archives develop different ways of describing and indexing their resources, no central meta-index can easily be developed. The amount of data will defeat a human librarian, and the different formats will defeat a machine.
1) Community Involvement. All the different stakeholders in the EL archiving enterprise must be kept fully informed and continually consulted. In particular, we must foster communication between computational linguists and field linguists; since a computational solution developed without the input of descriptive linguists will never become widely accepted. To the extent possible, we must also involve indigenous communities: native speakers of ELs should take part in markup formulation and community leaders in archive design.2) Flexibility. Any proposed solution must (a) have the capacity to handle legacy data in various formats and (b) allow for some continuing variation in individual practice. Not only will different languages and theories always call for different analytical categories, but different research questions will always call for different types of data manipulation and display.3) Collaboration. Organizations must pool their resources in light of: (a) the volume of work and the range of expertise needed for a unified solution and (b) the danger that partial, uncoordinated "solutions" will only exacerbate the problem (see 2.2 above).
The E-MELD project has been structured with these 3 requirements in mind. It implements part of a distributed solution proposed in Simons (2000b), which recommends a coordination of effort among the Linguistic Data Consortium (LDC), the Summer Institute of Linguistics (SIL), and The LINGUIST List: The Linguistic Data Consortium will function as a central repository of standards and software (which may be developed elsewhere); the SIL Ethnologue will constitute the standard reference for language classification; and The LINGUIST List will serve as a central repository of metadata, as well as an institutionalized conduit of information between language engineering projects and the linguistics community. [2]
LINGUIST has already taken several steps toward assuming its suggested roles (see Project Preliminaries, 3.7.1 below). The E-MELD project, which involves The Endangered Languages Fund and The University of Arizona, as well as the 3 institutions named above, represents significant, unified progress toward this collaborative goal.
Metadata - or structured data about data - can be as simple as a keyword in a META field within an HTML document. Even the simplest kind of metadata can be useful as resource description, once the document is retrieved. But resource discovery requires the use of a standardized format. In a context as vast and rapidly expanding as the modern-day Internet, data is only valuable if it is findable, and if its relevance is interpretable through computational means. Thus one of the most important parts of the E-MELD project is the initiative to collect metadata on language resources at a central site. Though we will focus initially on EL resources, the facilities created will be extended as soon as possible to catalogue linguistics-related resources of all types. Such a catalogue will not only allow extant material to be identified and retrieved; but it will also enable distributed data to be pieced together. Given a markup standard and a metadata server, it will not matter if a dictionary of a language appears at one site and a grammar of the same language appears at another. They can be linked through their metadata, and used in conjunction with one another. But in order to establish such a central index, it will be necessary to reach consensus about best practice in the creation of metadata for language resources, as well as to collect existing metadata and convert it into this format, and to institute user-friendly systems for input and query of the information.
One possible starting point is the standard defined
by the Dublin Core Metadata Initiative (
http://www.purl.org/dc/). The Dublin Core Element Set is limited to 15
elements standardized to begin with the prefix "DC," as in "DC.Creator."
Although some difficulties have been identified with large-scale implementations, the DC metadata standard
is gaining wide, cross-disciplinary acceptance, in part because of its simplicity
as compared to the full MARC standard. Resources which bear metadata in DC
format are interpretable and retrievable by many existing search tools (e.g.,
SWISH-E, WAIS-2.0, GLIMPSE, HARVEST, ISEARCH) and others which are
currently being developed (e.g., BC, described at
http://www.mpi.nl/world/tg/lapp/lapp.html). However,
any existing metadata standard will need to be augmented by recommendations
of best practice specific to language resources. Suppose, for example, that
we wish to use the DC Element Set to describe an online grammar of Mocovi
written in English. Part of a plausible resource description might be the
HTML header lines given as (1) below:
(1) <meta name = "DC.Subject"
content = "Mocovi">
<meta name = "DC.Type"
content = "grammar">
<meta name = "DC.Format"
content = "text/html">
<meta name = "DC.Language"
content = "en">
However, for the DC.Type, DC.Description, and DC.Format elements, "recommended best practice is to select a value from a controlled vocabulary or formal classification scheme" (Dublin Core Metadata Element Set, Version 1.1; Miller 1999). And no vocabulary or classification scheme appropriate for linguistics yet has general acceptance. The list of suggested DC types, for example, includes collection, dataset, event, image, interactive resource, model, party, physical object , place, service, software, sound, and text (Guenther 1999). It does not include grammar. Hence, the value 'grammar' for DC.Type in (1) has little general usefulness. It should perhaps be replaced by 'text', with 'grammar' provided as a value for DC.Description. But 'grammar' is not part of a recognized classification scheme for DC.Description either. [6] Lacking such discipline-specific controlled vocabulary - and lacking agreement even about the elements that should be included in linguistic metadata - language researchers will have increasing difficulty finding electronic resources. In developing and publicizing such lists, we intend to collaborate closely with the Linguistic Data Consortium and other groups, e.g., ISLE, which are also addressing the question of metadata format for language data. For instance, we intend to support the metadata initiative at the upcoming (Dec., 2000) Linguistic Exploration Workshop at the LDC in 2 ways: (a) by submitting examples of metadata collected from the advisors to our project and (b) by gathering feedback on the Exploration Workshop outcomes at our first workshop with field linguists and archivists, scheduled for June, 2001 (see Project Preliminaries, 3.7.1 below).
Typological information may include generalizing statements about (a) the set of types into which a language may fall, e.g., Subject-Verb-Object ordering, as opposed to Verb-Subject-Object and (b) the classification of a particular language according to these types. Although it is not metadata in the sense used above - i.e., it is not data about language resources - it may be construed as a kind of metadata about the languages themselves. And it is data of such potential usefulness to linguists that it is worth extending our concept of the metadata server to collect and provide it. At present there is no way for a linguist to find out what languages of the world are SVO or VSO. The Ethnologue does not collect such information; and though typological databases do exist in the hands of individual linguists, these are not generally accessible. For that reason, we intend to cooperate with related projects such as WALS (World Atlas of Language Structures: http://www.eva.mpg.de/~haspelmt/atlas.html) to formulate a web-based typological questionnaire which is brief enough to be practicable and yet contains questions about the information deemed most significant by a committee representing as much theoretical diversity as possible. The Summer Institute of Linguistics has agreed to ask their numerous field linguists to answer this questionnaire with regard to the languages they have studied. The online facility will also, of course, also record and display variant answers from linguists who disagree with the original descriptions; but the initial collection procedure will immediately provide a wealth of typological data relevant to pressing questions of EL research.
The E-MELD project will create user-friendly web interfaces for metadata input; and the PIs will contact cooperating archivists to request their metadata. In addition, LINGUIST intends to implement an innovative procedure to identify other sites on the Internet which store language data but have not yet participated in the project. This will involve using a spider to index other linguistics-related sites and configuring search software to search the index using a keyword list. In this way potential sources of metadata may be identified. The site owners will then be approached and invited to contribute to the database.
The spidering procedure will exploit LINGUIST's comprehensive collection of relevant URLs. Almost all links related to language and linguistics are announced on The LINGUIST List, and we have had a "URL-grabber" operative on the site since 1994. This custom software copies each link that passes through LINGUIST to a file. Now that LINGUIST hosts 70 other language-related lists (see Track Record, 4.2 below), our ability to collect linguistic URLs has been greatly enhanced. Our collection of links is a natural domain which a spider can index, as a first step to finding additional language data.
Furthermore, the index itself will be an extremely useful linguistic search tool―of a kind which, to our knowledge, no other discipline has. Since it will index only linguistics-related sites, searching this index will not return the unwieldy amount of irrelevant information that linguists inadvertently retrieve from a general web search engine. It will be a linguistics-specific Internet search facility; and it will be made freely available on the LINGUIST List site.
Markup is systematic annotation designed to reveal a text's typographical and informational structure. Linguistic markup - a particularly challenging sub-variety - might be broadly described as annotation representing: (a) the grammatical structure of text couched in the focus language and (b) the structure of documents presenting a linguistic description or analysis of such text. Linguistic markup is required in the digitization of such language documentation as paradigms, word lists, dictionary entries, and glossed text. And most language documentation invites both types of annotation (although, of course, a distinction may be maintained, e.g. in the use of parsing vs. stylesheet software).
Markup for interlinearized text, for example, must represent both the phonological, morphological and syntactic structure of the text and enough additional information to allow reconstruction of the conventionalized formatting which makes the information intelligible. This is exemplified in the fragment of a Mocovi text (Grondona 1998) given as (2) below:
(2) Glossed Mocovi text fragment
|
a. |
ka/maq |
yale |
yowito/ |
ka |
lawo/ |
ka |
/na:ko/: |
|
|
b. |
ka-/maq |
yale |
i+owir+o/ |
ka |
l+awo-r |
ka |