Jeff Good, U of Pittsburgh

The Descriptive Grammar as a (Meta)Database

Traditional descriptive grammars have many features associated with modern databases. They are highly structured documents consisting of discrete, titled sections, which, to a certain extent, are intended to stand on their own - serving as the equivalent of a database "record". However, detailed formal modeling of the structure of descriptive grammars has yet to be undertaken. This is in contrast to other database-like linguistic resources such as, for example, interlinear text (see, e.g., Bow, Baden, and Bird 2003) and dictionaries and lexicons (see, e.g., Sperberg-McQueen and Burnard 2002). Developing a formal model of the information encoded in descriptive grammars would clearly be an important step towards the creation of best-practice standards for electronic grammars.

This paper proposes such a model, largely based on a survey of four print grammars: two "best-practice" grammars (Rice 1989, Haspelmath 1993), a grammar from a language family (Bantu) with its own specialized descriptive tradition (Maganga and Schadeberg 1992), and a "legacy" grammar (Williamson (1965), which is partially cast in early transformational grammar). The proposed model has the following characteristics:
  • The basic structure of a descriptive grammar is understood to be a set of annotations (generally in the form of discursive text) on a corpus of language data (mostly consisting of sentences and paradigms). A grammar is thus a kind of "metadatabase" consisting of generalizations over a database of linguistic forms.
  • The annotations are typically accompanied by exemplars - data samples chosen by the linguist as good examples of the phenomenon under discussion.
  • The annotations on the data, either explicitly or implicitly, make extensive use of three distinct types of ontologies:
    • General ontologies: Sets of terms assumed to be understood by the whole linguistics community (e.g., singular/plural, inflection/derivation)
    • Subcommunity ontologies: Sets of terms assumed to be understood by a subcommunity of linguists who work on a particular family or other well-defined group of languages (e.g., the Bantuist term extension, for a special class of verbal suffixes; the "Middle-Eastern" term masdar for a kind of verbal noun)
    • Local ontologies: Sets of terms specific to the particular language under examination (e.g., the reduced genitive in Lezgian, the distinction between the short and long applicative in Kinyamwezi)
Aspects of this model have been straightforwardly represented in XML by treating the "annotation" as the primary content of the grammar and including methods for reference within the annotation to grammatical data in external resources (like a dictionary or texts) and for reference to terms drawn from ontologies. In (1a) an adapted version of the XML format currently employed is given along with a possible display format of the representation in (1b). In addition to being expressible in XML, this model can be straightforwardly encoded in a database format where each annotation is understood as the core of a database "record".

This proposed model is derived from the practice of print grammars. However, a range of extensions to it for electronic grammars can be envisioned, including encoding of directives for dynamic retrieval of all examples of some grammatical phenomenon found in a corpus (a simple version of which has been implemented) and encoding of machine-readable formal grammar rules accompanying the discursive text (presently under research).

The descriptive grammar as a (meta)database

Bow, Cathy, Baden Hughes, and Steven Bird. 2003. Towards a general model of interlinear text. Proceedings of EMELD Workshop 2003: Digitizing and Annotating Texts and Field Recordings. LSA Institute: Lansing MI, USA. July 11 - 13, 2003.

Haspelmath, Martin. 1993. A grammar of Lezgian. Berlin: Mouton.

Maganga, Clement and Thilo Schadeberg. Kinyamwezi: Grammar, texts, vocabulary. Köln: Rüdiger Köppe Verlag.

Rice, Keren. 1989. A grammar of Slave. Berlin: Mouton.

Sperberg-McQueen, C.M. and L. Burnard. TEI P4: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium. XML Version: Oxford, Providence, Charlottesville, Bergen.

Williamson, Kay. 1965. A grammar of the Kolokuma dialect of Ijo. Cambridge: Cambridge, University.