Catherine Bow, Baden Hughes & Steven Bird, University of Melbourne

Towards a General Model for Linguistic Paradigms

If we consider a database to be a structured collection of data arranged for efficient search and retrieval, we can in fact consider a wide range of structures prevalent in linguistic research and language description as databases. A widely used linguistic data structure is that of the paradigm, a tabular representation commonly used in linguistic description. Complex multi-dimensional information is frequently presented in such a manner, such as phoneme charts, inflectional forms of words, etc. The goal of this paper is to describe an encoding model for linguistic paradigms through consideration of a range of paradigms as found in the literature.

In general, paradigms can be viewed as a two-dimensional arrangement of elements and attributes, with optional row and column labels. For example in a consonant chart the convention is for the horizontal axis to denote the place of articulation, and the vertical axis to denote the manner of articulation. Each axis can be segmented into elements according to a number of criteria, and each element can have a series of attributes. The elements along the horizontal axis typically incorporate values for bilabial, alveolar, velar, etc. and the vertical axis incorporates values for stop, fricative, lateral, etc. The dimension of voicing may be represented in various ways, such as including pairs of values within each cell for voiced and voiceless forms, or dividing the vertical axis within each manner of articulation. Cell contents can vary between phonetic characters or phonemic representations and are optionally empty where gaps in the linguistic inventory or data occur. Another common example is that of morphosyntactic paradigms, where a tabular representation of verb conjugation may involve a combination of persons, numbers, genders, tenses, aspects, moods, cases, etc. Whilst convention still holds in some instances, there is a greater degree of flexibility in terms of selection of elements which are presented on the vertical or horizontal axes, and constraints on the ordering of these elements, and the attributes for each element.

A significant advantage of the linguistic paradigm is its ability to present complex data in tabular form, so that several dimensions of information can be presented in a two-dimensional table. Paradigms displayed on the page incorporate a variety of devices to represent more than two dimensions. The range of presentations possible for the same data set indicate that the underlying structure of the paradigm can be rendered into a variety of outputs. The constraints inherent in the two-dimensionality of the printed page obscure the complexity inherent in the underlying model. The challenge is to clearly express dynamic multi-dimensional paradigms in the static two dimensional format of the printed page.

We propose a paradigm encoding model expressed in XML, utilising XML's architectural support for namespaces and schemas to describe the relation of linguistic data appearing in a single paradigm with other relevant data (lexicons, morphosyntactic descriptions, etc.). Linguistic paradigms encoded in this manner are thus able to be queried in a range of ways - including at the cell and axis levels, but also in relation to other linguistic artifacts such as lexicons and interlinear texts. We propose to interpret the presentation of a linguistic paradigm encoded in this fashion as a rendering problem. Such an approach allows us to gain two efficiencies: an abstract structural representation as well as a high degree of flexibility in presentation.