Wallace Hooper, Douglas Parks, Damir Cavar & Paul Kroeber , Indiana University

Linguistic and Archival Databases at the American Indian Studies Research Institute: Experience, Future Directions, and Best Practice

The American Indian Studies Research Institute documents North American languages based on field work and analysis and publishes dictionaries, grammars, and historic field data. We have transcribed, digitized, and organized masses of information from historic sources from the nineteenth and twentieth centuries and from our own projects from the late 1960s up to the present.

AISRI began to use databases and computational technologies to underpin those efforts in 1987 and has continued to use them through several major paradigm changes in database and computational theory and practice. We have tried to take advantage of new tools as they emerge to enhance and improve our field work and documentation projects, and to manage the data that have accumulated. Our resources now include databases and data sets in several different database system types that represent and exemplify the changing state of both linguistic and computational practice. These range from the latest XML-based models recommended as best practice by EMELD and OLAC through web-based and desktop-based relational databases, legacy flat-file databases, and legacy collections of data sets residing in wordprocessor files.

The Indiana Dictionary Database Processor (IDD) is a fully normalized relational database system that we have discussed at previous meetings of SSILA, AAA, OLCAC, and EMELD. We have had to add components to publish its data on the web and to export Unicode-encoded XML documents. Our interlinear text processor, the Annotated Text Processor (ATP), which we discussed at the last EMELD workshop, uses denormalized relational database strategies to support XML-compatible data management and manipulation. That project has required us to explore the intersection of relational database technology, XML theory and practice, and the realities of support for Unicode.

We have also just embarked on a new set of projects with computational linguists and other collaborators in Indiana's Department of Linguistics, and Digital Libraries Initiative that again look ahead to new database technologies and practices. In particular, we plan to explore the use of computational methods for the morphological analysis of North American languages with special interest in polysynthetic languages in families like Caddoan, Salishan, and Algonkian. We are also heavily engaged in revising our archival, digitization, and dissemination practices with special concern for the future use of our existing data collections. And we continue to pay attention to EMELD's GOLD upper-level ontology project and ponder how we may adapt to its practical results.

Our paper will review our lengthy experience in view of the discussion and recommendations that have recently emerged from EMELD and attempt to draw some lessons. We will also sketch our plans to adapt to new computational paradigms and carry forward the data in our collections.