Peter Wittenburg, Daan Broeder & Hennie Brugman, Max Planck Institute for Psycholinguistics

Databases for Linguistic Purposes

The concept "database" refers to a container of data that encodes and structures data in a coherent way and offers methods of accessing the data. Therefore, each type of database is associated with an underlying data model. To just mention a few: (1) Data stored in files (can be XML-structured for example) that are organised in directory hierarchies, (2) ISAM databases for hierarchically structured data, (3) relational databases that are based on the Entity-Relationship model and (4) native XML databases that contain XML-structured data. Each data model comes with its constraints making them more or less attractive for specific purposes.

Currently, very popular for the linguistic discipline is the usage of repositories of XML-structured files and relational databases. At the MPI both models are used for scientific and administrational purposes for many years. Relational databases systems such as ORACLE, mySQL or Postgres draw their strength from the underlying E-R data model, i.e., for all data that can be structured in a set of related tables with typed and constrained attributes this model is perfectly well suited. With SQL the rDBMS offer a simple and yet powerful "API" that is based on simple algebraic principles. Theoretical work led to recommendations (normal forms) allowing to make proper structural design. All encodings and operations are optimised for operating on such data. A consequence of this type of representation is that the data is encapsulated by a shell and not directly readable.

This algebraic framework allows us to use rDBMS for many administrative tasks. For good reasons we also decided to base the development of one of the first large computer-based dictionaries (CELEX) on the relational model. It required a careful design leading to a data representation that is free of redundancy. This was seen as one of the basic decisions for developing these large lexica for German, Dutch and English by a team of about 10 specialists. The result was a set of about 40 related tables per lexicon to cover all the necessary lexical structures and attributes. This utterly successful example also showed that the E-R model could be used data that is basically hierarchically structured. However, in CELEX no complex semantic information was included that will lead for example to cross-references between lexical attributes which would be conflicting with the underlying model. Also hierarchical relationships between lexical entries can only be exploited by algorithms that go beyond SQL and include procedural elements.

For the speech error database the usage of relational databases was investigated as well. The data was created in many different ways by several independently working authors. Finally, a DTD was established to describe the structure of a unified XML-file that allowed to merge all the different sources. Due to the unification it resulted in a large number of attributes per entry with only a sparse degree of filling.

Also the structure of the data was such that it included many small hierarchical extensions. Therefore, XML was perfectly well suited to model, validate and control the data. To setup a web-accessible database supporting fast searches, however, a relational database was created from the XML-based repository. The many small hierarchical extensions led to the construction of many tables, obviously. In this application the reason for having chosen a relational database was simply to achieve high performance for the most frequent query types.

For two very important pillars of our recent work, however, we did not choose for a relational database as primary container: the nucleus of the IMDI metadata infrastructure are XML-structured files and the same is true for the annotation files created by the ELAN annotation tool. The considerations that led us to this decision are not related with the data model. The primary concern was that metadata records should be open, human readable and distributed. In choosing repositories of XML files it is very simple to create a distributed and interlinked metadata domain where it is also easy to create for example parallel link structures (personalised metadata domains). Also this permits using tools that directly read and write the annotation and metadata files so that on a researchers own desktop machine or LAPTOP, not seldom used in a situation without network connectivity, he does not need a DBMS installed.

It is also a simple task to generate an HTML version on the fly to enable standard web browsers to navigate in such a linked domain. The usage of a relational database as primary container would always require a special and complex shell to provide the appropriate services. In particular, for small groups or individuals that may want to integrate their data into the domain in an easy way this aspect is of great importance. As in the domain of independent web-sites large index files are generated to support fast searches. Here, relational database are of great help, since they are optimised for this kind of operations.

For complex multi-modal annotations of multimedia recordings we have chosen repositories of XML structured files as the persistent format. Their structure was derived from a rich UML-based object model. Both data models - the hierarchical underlying XML and the relational - are not perfectly well suited for representing the needed complexity. However, XML files can be used for archiving, for direct exploitation by interested users and for easy cross-linking which becomes increasingly important in linguistics.

In the case of a representation with a relational database we would not have an archival format and every access or reference would require some operation. Again, for special applications such as searching on large sets of annotations the usage of a special container type could be useful to yield high performance for example.

In the talk we will explain with the help of some concrete examples as indicated above the advantages and disadvantages of the major data models for representing and accessing linguistic data. We will also explain why we believe that the current trend according to which an increasing number of linguists are creating important linguistic content with the help of relational database systems is leading to a trap. The internal storage format is not directly accessible, only a few linguists are aware of the need to export the data early enough and the DOBES and ECHO experiences show that in general the data exported from relational database is erroneous and needs much pos4-processing to be useful.

Therefore, we see the great risk that much useful data will be lost after a few years when for example the access software (version) is not available anymore.

CELEX = Centre for Lexical Information ( http://www.mpi.nl.nl/world/celex)
Speech Error Database = not yet open available
IMDI = ISLE Metadata Initiative ( http://www.mpi.nl/DOBES)
ELAN = Eudico Linguistic Annotator ( http://www.mpi.nl/tools)
DOBES = Documentation Bedrohte Sprachen ( http://www.mpi.nl/DOBES)
ECHO = European Cultural Heritage Online (http://www.mpi.nl/ECHO)