Daan Broeder, Adarsh Mehta, Eric Auer, Freddy Offenga & Peter Wittenburg,
Max Planck Institute for Psycholinguistics

LAT: MPI's Language Archiving Technology

At the MPI we have accumulated much experience in ways to store, manage and archive language resources. The advances in the technology of digitization and storage started about 10 years ago to allow researchers to produce resources in such quantities that classical ways of storing and keeping track of data no longer sufficed. To allow efficient management and utilization of the stored data, we started to develop ways for the users to browse and search through the digital archive and to accommodate users that might feel uneasy with departing from their resources, we also decided to offer these same services for use with data in a local repository on the users notebook or desktop machine.

Looking for ways to check resources into a well-organized and well-managed domain and to allow users to retrieve them, we hoped that users would benefit in two ways: (1) the archivist taking care of long-term data survival; (2) simplifying retrieval for the researchers. We created a discovery framework combining different methods such as browsing, geographic browsing, structured searching and unstructured searching around the IMDI metadata set for language resources. Thanks to international contributions and projects it developed from a limited in-house standard to something that was widely agreed. At the moment IMDI with all its flexibility and features (Editor, Browser, Search) is still at the core of the archive and its infrastructure.

Intensive use of the MPI archive forced the need to develop additional management options such as a full-fledged and efficient Access Management System that also supports delegation, since depositors want to keep control about who will access their data. In addition, we realized that the amount of data added to the archive increased with such a rate that it could no longer be handled by the archive managers manually - the amount of errors increased endangering the consistency of the archive. Consequently, LAMUS was added acting as a gate-keeper to take care of consistency and format coherence. The web-based LAMUS tool allows users to act as their own archive manager to modify existing data and upload new resources into the archive. LAMUS removed the archive manager as a bottleneck at the deposition side, allowing us now to offer archive services even to researchers with whom we don't have direct collaborations.

Also we realized that researchers started to not simply see the archive as one big container to deposit data but wanted to use the archived data directly via the web. The ANNEX and LEXUS tool concepts were the result. They allow users to access complex annotated media streams and multimedia lexica.

New requirements resulting from developments in archiving technology but also from users demand for increased interactive archive use will drive our technology ahead to incorporate concepts such as unique persistent resource identifiers, a smart versioning system and ontology supported cross-corpus access.

What started a decade ago as an opportunity to overcome an increasing chaos of undiscoverable digital resources emerged to an web-accessible archive with lots of advantages for the user.