Proposed Grammatical Features Resource

Anna Kibort & Greville G. Corbett (Surrey Morphology Group, University of Surrey, UK)


* The research reported here is supported by ESRC grant number RES-051-27-0122. This support is gratefully acknowledged.

1. Background and relation to GOLD

The Grammatical Features Resource project is complementary to GOLD, and the motivation for offering the paper is to aim for useful interaction. The Features project, begun at the University of Surrey in November 2004, aims to deepen the knowledge of the linguistic concept ‘feature’ by bringing together typological research on the content of features with formal work on their behaviour. One of the objectives of the project is to produce an on-line Inventory of Features, listing features proposed, with sources pointing to the decisive evidence. The typological approach is meant to help ensure that the feature theory proposed can meet the range of diversity found in natural language. The Inventory, in turn, will be invaluable as a steppingstone for asking: What can be a feature? What features occur across different components? How do features interact? What potential features, as inferred from the patterns of the occurring features, appear to be missing from the feature inventory? The careful catalogue of the various types and uses of features will provide the basis for a theoretical conceptualisation of the notion ‘feature’. It will help demonstrate the type of features on which linguistic theory can legitimately call and the implications of adopting different theoretical perspectives on features while using them for the same descriptive goals.

The compilation of the Feature Inventory is one clear area of overlap of the Features project with the objectives of the GOLD project. The practical approach taken in GOLD has been to obtain a list of possible feature values from a selected source, with possible supplementation from other sources (Farrar, Lewis & Langendoen 2002). This sidesteps two major issues in constructing a feature/value inventory, namely the analysis problem and the correspondence problem. The proposed features site is intended to provide resources for addressing these substantial issues.

     2. A feature inventory: the analysis problem

Analysis requires us to state how we show that a particular language has, or has not, a given feature, and how many values the feature has. Sometimes choosing a feature and value in the description of a language is straightforward. In those cases where it is not straightforward, the proposed Features Resource will point to the way it can be done, with sample analyses of some difficult cases.

     2.1. Establishing whether a language has a feature

This section gives an example of a situation where the decision of whether or not to use a particular feature in the description of a language requires careful consideration. We tend to assume that languages have a person feature, but with Archi (Nakh– Daghestanian) this is not self–evident. Archi (like some related languages) has no unique forms for agreement in person, and the standard description of this language (Kibrik et al. 1977, Kibrik 1977a, b) does not involve the feature ‘person’. However, the agreement patterns in Archi may be interpreted in favour of the presence of this feature, despite the absence of any phonologically distinct forms realising it.

Archi distinguishes four genders and two numbers. Table (1) lists affixal agreement forms marking verbs in Archi, with examples in (2).

The personal pronouns zon ‘I’ and un ‘you (singular)’ take gender agreements corresponding to the gender of the speaker or addressee: male humans trigger gender I agreement, female humans — gender II agreement, and imaginary locutors of genders III and IV (e.g. a speaking cow and a speaking goat kid) trigger gender III and IV agreements, respectively.

If Archi has no person feature, we should expect the same pattern of agreement, based on gender, to occur with personal pronouns in the plural. Indeed, this is what happens with the personal pronoun teb ‘they’. It takes gender I/II agreement (the prefix b–) when the referents are human, and gender III/IV agreement (zero marking) when the referents are non–human:

However, unexpectedly, the personal pronouns nen ‘we’ and žʷen ‘you (plural)’ referring to humans do not take the gender–based I/II agreement marker (b–). Instead, they trigger zero marking, which we gloss as III/IV.PL as in Table (1):

One possibility to account for this unexpected agreement is to say that the two pronouns nen ‘we’ and žʷen ‘you (plural)’ are unusual lexical items. This approach was adopted in the standard grammar of Archi (Kibrik et al. 1977); apart from the four clear–cut genders (I–IV), four more genders (V–VIII) were proposed to account for different patterns of agreement triggered by some pronouns and a very small number of nouns involving human referents:

Thus an analysis without a person feature requires a complication of the gender system.

The next thing that needs to be considered is resolution rules specifying the form of an agreeing element when the controller consists of conjoined noun phrases. For most combinations of conjuncts, the significant factor seems to be the presence of at least one conjunct denoting a rational. Thus, conjuncts headed by nouns in genders I and II trigger the I/II plural form of the verb as in (6a–c), and when there are no conjuncts denoting rationals, III/IV plural agreement is found as in (6d–e), (Kibrik 1977b:186– 187, Corbett 1991:271–273):

The noun xalq’ ‘people’ takes gender III agreement when singular and gender I/II agreement when plural (in (7) below we gloss it as III, though it has been separated out into its own ‘gender VII’ by Kibrik). The behaviour (in coordinate constructions) of this noun and a small set of other nouns involving human referents can similarly be attributed to the fact that these nouns denote rationals. The presence of xalq’ in coordinate constructions, then, triggers gender I/II agreement as expected:

However, when one of the conjuncts is a pronoun (from Kibrik’s gender V or VI), the hypothesised resolution rule based on semantics does not produce the expected result. Instead of the predicted gender I/II marking (b–), the following conjoined phrases trigger zero marking, the form equivalent to gender III/IV plural agreement, as in example (4) (but see (9) below, where we give a revised analysis):

Kibrik’s solution to the agreement pattern in coordinate constructions in Archi has been to group the proposed eight genders into ranks, with rank 1 comprising genders V and VI; rank 2 — genders I, II, VII and VIII; and rank 3 — genders III and IV. He then suggested a resolution rule, based on the system of eight genders and their ranks, according to which the target verb and auxiliary will agree with the gender of the conjunct belonging to the numerically lowest rank (rank 1 < rank 2 < rank 3). The rule accounts for all the examples discussed above, but it is typologically an odd resolution system. First, it is ‘two–level’, with genders and ranks of genders. Second, the reference to genders V and VI is essentially an indirect way of referring to personal pronouns, making them a kind of exceptional category within the gender system.

We should indeed base our gender resolution rules for Archi not on gender classes, but on a general rule formulated in purely semantic terms (i.e. if there is at least one conjunct denoting a rational or rationals, gender I/II agreement will be used; otherwise gender III/IV will be used). This much is compatible with what we know about systems of resolution rules, but it does not account for the agreement with conjoined pronouns.

Therefore, rather than treating the personal pronouns as each being an exception in terms of gender, we prefer to accept a person feature. Specifically, the gender I/II marking (b–) in examples (4) and (8) would have been expected if we took into consideration only gender and number, following the pattern from the singular. However, if we recognise that there may be person agreement in Archi despite the lack of unique phonological realisation, both the behaviour of the two plural pronouns (‘we’ and ‘you’) and the resolution rules in general become simpler and cease to be typologically odd. Formulated in this way, the gender resolution rules are fairly usual, and the person resolution rules required (persons 1 and 2 > person 3) are standard, except for the interesting point that there is no distinction here between persons 1 and 2.

Thus, we analyse Archi as having agreement in person, with the following paradigm:

There is no unique form for person agreement in Archi (since the Ø– marker for 1 and 2 person plural also serves for genders III and IV in the 3 person), and the feature is limited both with regard to its domain (only the plural) and values (3 versus 1/2, but not 1 versus 2). However, to account for the use of agreeing forms and their distribution in the paradigm, we use the person feature. We need to know the person of the subject in Archi in order to select the appropriate agreement marker for the verb. Thus, the obvious ploy of requiring a phonological realisation is shown to be inadequate for instances like person in Archi, where no single form demonstrates the existence of the feature and yet there are good arguments for person in this language. Whatever one’s final choice, person is an interesting point of analysis of Archi.

     2.2. Establishing the values of a feature

If the existence of a feature is demonstrable, we must then show how many values it has. Again, requiring a phonological realisation is too simple. There are instances in the literature of careful argumentation for difficult instances, notably the debate as to the number of case values in Russian (Zaliznjak 1973 and Comrie 1986).

In most traditional and pedagogical literature11, Russian is described as having six cases (nominative, accusative, genitive, dative, instrumental, and prepositional/ablative), but there are very good reasons for distinguishing three more cases in this language. First, contemporary Russian is developing a separate vocative case (distinct from nominative). Second, some nouns such as syr ‘cheese’ have a distinct form for partitive (genitive) syru in addition to (nonpartitive) genitive syra. And finally, apart from the prepositional case, there is a distinct locative case in Russian, as in prepositional [o] sade (‘[about] the orchard’) and locative [v] sadu (‘[in] the orchard’) from sad (‘orchard’). According to Comrie, the fact that the last two case distinctions in Russian are an innovation may account for the reason why most contemporary sources are reluctant to recognise them: they do not fit in with the traditional assumptions as to what cases a Slavonic language may have. However, they have to be considered in any adequate synchronic description of Russian.

Latin and Latvian provide even more challenging examples of a discrepancy between the standard assumption regarding their case systems and the reality of the morphological phenomenon of case. Apart from pointing out the inadequacy of standard descriptions, Comrie (1986) uses these two examples as further support for his claim that the traditional characterisation of case, inadequately combining formal and functional criteria, leads to immense complications in the description of case.

In short, it is standard to describe Latin as having six cases (nominative, vocative, accusative, genitive, dative, and ablative) and then, while discussing the declension of individual morphological classes, introduce an additional form (locative case) for a small subset of nouns (names of towns and small islands, and a very restricted number of individually specifiable lexical items) in most of these classes. For other nouns, the function performed by the locative case is expressed by using the preposition in ‘in’ which takes the ablative case. Apart from the fact that, by the traditionally adopted distributional criterion, the locative has to be defined as a separate case, we have here ‘distributional grounds for identifying a prepositional phrase (with one set of nouns) with a case (for another set of nouns)’ (Comrie 1986:94).

The case system of standard Latvian, on the other hand, is an example of a system which is impossible to analyse adequately with traditional formal tools (Fennell 1975, Comrie 1986). Traditional accounts of Latvian list a separate instrumental case which, however, only occurs with the preposition ar. In the singular, the allegedly ‘instrumental’ form of the nominal that appears after ar is identical to the accusative, and in the plural it is identical to the dative. If a separate instrumental case was established on this basis, it would be distinguishable formally from both the accusative and the dative. However, according to traditional accounts, all Latvian prepositions take the dative in the plural regardless of the case (accusative, genitive, or dative) that they govern in the singular.

The best solution to this inconsistency in traditional description (though still adhering to the traditional rules of description) would seem to be to say that Latvian has no instrumental, and that the preposition ar governs the accusative case (with the provision that, like all Latvian prepositions, in the plural it governs the dative). However, by the distributional criterion, this suggestion (as well as the original traditional account) creates a contradiction: a given preposition may not govern one case in the singular and a different case in the plural, because in this way the very distributions of one and the same case would be different in the singular and the plural.

The distributional criterion forces us to say that the cases occurring after prepositions in Latvian (other than after prepositions that take the dative in the singular) can never be identified with singular cases occurring other than after prepositions, although ‘accusative2’ and ‘genitive2’ (which would be the cases occurring after prepositions) in the plural are homonymous with the dative. As is clear, this solution is redundant and misses the obvious generalisation, which should be captured in a grammar of Latvian, that all prepositions in Latvian require the same form of a plural nominal.

Comrie’s solution to the description of a complex case system like this is an approach to the notion of case which attempts to synthesise the formal and functional aspects of case and, in particular, scrutinises the relation that holds between these two sides. His approach relies on the notion of feature analysis of case: both distributional and formal cases can be characterised in terms of the same features (e.g. the feature [genitive]), but a formal case (e.g. syra ‘cheese.NONPARTITIVE’, syru ‘cheese.PARTITIVE’, muki ‘flour.GENITIVE’) may correspond to a subset of the features of a distributional case ([genitive, nonpartitive], [genitive, partitive], or [genitive], respectively), thus giving rise to many–to–one mappings between distributional and formal cases. This enables an adequate analysis of case ‘syncretism’ and at the same time accounts for generalisations within a case system (for details see Comrie 1986).

The final example illustrates an analysis problem pertaining to a different morphological category: that of ‘number’. Establishing the possible values of number proved a long and difficult undertaking and its results, which are summarised below, can be found in Corbett 2000. A considerable variety of values are available in those languages where number is specified, with the most complex systems having five values; e.g. singular and plural (English); singular, dual, and plural (Upper Sorbian); singular, dual, trial, and plural (Larike, Central Maluku, Indonesia); singular, paucal, and plural (Bayso, East Cushitic, Ethiopia); singular, dual, paucal, and plural (Yimas, Lower Sepik, Papua New Guinea); singular, dual, trial (or paucal), paucal (or greater paucal), and plural (Lihir, Oceanic, Papua New Guinea).

The value of number that required special consideration was that of the ‘quadral’, a set of forms specifically for the quantity of four (Corbett 2000:26–30). The quadral has been used in the description of at least three languages from the Austronesian family. A well–documented suggested case is Sursurunga (Hutchisson 1986, and personal communications). The forms labelled quadral are restricted to the personal pronouns, but are found with all of them: the first person (inclusive and exclusive), the second and the third. However, besides being used of four, the quadral has two other uses which account for most of its instances. First, plural pronouns are never used with terms for dyads (kinship pairs like uncle–nephew/niece) and the quadral is then used instead for a minimum of four, and not just for exactly four. The second additional use is in hortatory discourse; the speaker may use the first person inclusive quadral, suggesting joint action including the speaker, even though more than four persons are involved. Thus, if the values of number are based on meaning, the quadral forms might be better designated ‘paucal’ rather than ‘quadral’.

Similarly, the suggested ‘trial’ in Sursurunga is also used for small groups, typically around three or four, and for nuclear families of any size. It is therefore not strictly a trial — rather, it could be glossed as ‘a few’ and also qualify as a paucal. The traditional quadral, then, which is frequently used with larger groups of four or more, and could be glossed as ‘several’, is in fact a greater paucal, and the traditional trial is a (normal/lesser) paucal. The plural, as we would expect, is for numbers of entities larger than are covered by the quadral (though there is no strict dividing line at any particular number), and the number system of Sursurunga can be represented as: singular, dual, paucal, greater paucal, and plural.

The other two languages for which the quadral has been claimed can be analysed along the same lines. Tangga, related closely to Sursurunga, (Capell 1971:260–262; Beaumont 1976:3902; confirmed by Malcolm Ross, personal communication) also has five number forms, and it seems clear that the forms which have the numeral ‘four’ as their source are not quadrals but rather paucals (Malcolm Ross, personal communication citing Maurer 1966; this is also Schmidt’s view given in Capell 1971:261). Unfortunately, we have no information on whether Tangga has a genuine trial or whether it has two paucals. Marshallese, more distantly related to Sursurunga, with five number forms for the first, second and third person pronouns (Bender 1969:8–9), also has an additional use of the quadral form: it is often used rhetorically with groups of more than four to give an illusion of intimacy (Bender 1969:159). Again, then, this may not be strictly a quadral.

Finally, there are several false trails in the literature regarding quadrals — that is, suggestions of other Austronesian languages with quadrals, which turn out in fact to have four number values not five. In such cases, the plural may have a form in which the numeral four can be reconstructed. Thus, we have found no clear case of a quadral, by which we mean a grammatical form for referring to four distinct real world entities in the way that trials refer to three.

     2.3. Conclusions for a feature inventory

The proposed Grammatical Features Resource will offer two types of relevant information in addition to the listing of the features: first, the arguments which have been used to justify postulating features and values (with reference to the sources), and second, instances of challenging systems (which are typically those with most and sometimes with least values). The sections above illustrated the type of information that will be available on the website to inform users of the available choices of feature labels and the corollaries of those choices. It will contain examples of full paradigms, more detailed commentaries than those outlined in the present paper, and — where possible — links to the sources of information and/or data.

     3. A feature inventory: the correspondence problem

Consider now the correspondence problem, starting with the simplest instance. French and Slovene both have masculine and feminine genders. Do they correspond? Yes, in the sense that nouns denoting females are typically assigned to the feminine gender. No, if we consider that French has two genders and Slovene three. In an inventory of features, labels such as ‘FEM’ or ‘MASC’ are bound to have different meanings depending on the system of which they are a part (e.g. ‘FEM’ in a two–gender system versus ‘FEM’ in a three–gender system). Furthermore, the semantics of the genders can differ dramatically, as say between Tamil (semantically assigned) and Slovene (semantically and formally assigned), though both have three genders. Therefore, when describing the feature inventory, it is not enough to list gender values, but we need some declaration of the gender system to which they belong.

To complicate matters even further, there are many challenging cases where the correspondence problem might be thought of as occurring language–internally. Romanian has two genders like French if we look at agreement targets, but three like Slovene if we consider the nouns (i.e. agreement controllers). When describing the feature inventory, we need to have a way of stating the correspondences within the gender system.

In order to ensure that the starting point for the description of gender is the same for all languages, and to enable us to make meaningful comparisons between languages with relatively transparent gender systems and those with more complex gender systems, we need to distinguish ‘controller genders’ and ‘target genders’. The following examples are taken from Corbett 1991.

In several of the more familiar languages, the gender pattern is straightforward and the way in which the system is analysed is taken as self–evident. French, for example, is taken to have two values of gender, and this works well, because not only nouns are ‘masculine’ or ‘feminine’, but agreeing verb forms and adjectives are, too. Thus, controller genders and target genders correspond in a straightforward way and the gender values are labelled ‘masculine’ and ‘feminine’:

However, difficulties arise in languages with more complex gender systems. If the distinction between controller and target genders is not considered, gender values may be presented as though the pattern was equally uncontroversial, but no indication is given about what the values really mean. Then, the number of genders in a particular language can be the subject of interminable dispute, or we find that similar situations are described differently by those working on different language families.

A good example of a language whose gender system has been the source of continuing disagreement is Romanian (for references to the extensive literature on this topic see Corbett 1991:150). The argument, which has gone on for decades, is whether we have two genders or three. In terms of agreement classes, the situation is clear: there are three classes that should be set up as follows:

However, simply to say that Romanian has three genders suggests that it is like German, Latin or Tamil, even though in each of these languages, intuitively, the situation is rather different. It can be seen that Romanian has three controller genders (i.e. the genders into which nouns are divided), and it has two target genders (i.e. the genders which are marked on adjectives, verbs, and so on, depending on the language) in both singular and plural. The gender system of Romanian can, then, be diagrammed as follows (illustrated with the agreement forms of the adjective bun– ‘good’):

The controller genders in Romanian (i.e. the lines labelled ‘class I nouns’, ‘class II nouns’, and ‘class III nouns’) are usually called ‘masculine’ (I), ‘feminine’ (II), and the disputed gender (III) is sometimes called ‘neuter’ and sometimes ‘ambigeneric’; the latter is a useful term, provided it is used not to imply that there is no distinct gender but rather that the situation is different from the more common Indo–European three–gender system.

While there are many languages where the number of controller and target genders are the same, mismatches of the type that occurs in Romanian are common. Examples of several even more complex systems are given in Corbett 1991. The important point here is that the mismatches do not concern one or two odd exceptions within the category of nouns or agreeing forms, but they concern substantial parts of the lexicon of the language: the diagram above represents systematic correspondences occurring across the whole lexicon in Romanian. There are a couple of ways in which this information could be represented in a feature inventory, and these will be given as options to the users of the proposed Features Resource, with justification for the suggested feature labels.

     3.1. Conclusions for a feature inventory

The solution to the correspondence problem, occurring both language–internally and crosslinguistically, requires careful analysis of the system behind each feature, whether gender, case, or any other. We have not elaborated here on the question of cross–language comparability of feature values, and it is certainly possible that some individual feature values in some individual languages (for example cases) are highly language–specific. But a feature inventory should be derived from theoretical considerations that should be able to capture the fact that there are also substantial crosslinguistic similarities.

The contribution here of the Features Resource will be to provide outline typologies, so that labels such as ‘FEM’ or ‘NEUT’ can be referred to a typology, clarifying the type of system within which they are functioning. The underlying philosophy, which is in tune with work on GOLD, is to offer useful tools for analysis and annotation.

1. For a current standard source listing declensional cases in Russian see, for example,
2. Note that Capell and Beaumont used the term ‘quadruple’.

     4. References

Beaumont, Clive H. 1976. Austronesian languages: New Ireland. In: Stephen A. Wurm (ed.) Austronesian Languages: New         Guinea Area Languages and Language Study II (Pacific Linguistics, series C, no. 39), 387–397. Canberra: Department of         Linguistics, Research School of Pacific Studies, Australian National University.
Bender, Byron W. 1969. Spoken Marshallese: an Intensive Language Course with Grammatical Notes and Glossary. Honolulu:         University Press of Hawaii.
Capell, Arthur. 1971. The Austronesian languages of Australian New Guinea. In: Thomas A. Sebeok (ed.) Current Trends in         Linguistics VIII: Linguistics in Oceania, 240–340. The Hague: Mouton.
Comrie, Bernard. 1986. On delimiting cases. In: Richard D. Brecht & James S. Levine (eds) Case in Slavic, 86–106. Columbus,         OH: Slavica.
Corbett, Greville G. 1991. Gender. Cambridge: CUP.
Corbett, Greville G. 2000. Number. Cambridge: CUP.
Farrar, Scott, William D. Lewis & D. Terence Langendoen 2002. A common ontology for linguistic concepts. Proceedings of the         Knowledge Technology Conference, Seattle, WA, March 2002.
Fennell, Trevor G. 1975. Is there an instrumental case in Latvian? Journal of Baltic Studies 6:41–48.
Kibrik, A. E., S. V. Kodzasov, I. P. Olovjannikova & D. S. Samedov. 1977. Opyt strukturnogo opisanija arèinskogo jazyka, I:         Leksika, fonetika. (Publikacii otdelenija strukturnoj i prikladnoj lingvistiki, 11). Moscow: Izdatel’stvo Moskovskogo         universiteta.
Kibrik, A. E. 1977a. Opyt strukturnogo opisanija arèinskogo jazyka, II: Taksonomièeskaja grammatika. (Publikacii otdelenija         strukturnoj i prikladnoj lingvistiki, 12). Moscow: Izdatel’stvo Moskovskogo universiteta.
Kibrik, A. E. 1977b. Opyt strukturnogo opisanija arèinskogo jazyka, III: Dinamièeskaja grammatika. (Publikacii otdelenija         strukturnoj i prikladnoj lingvistiki, 13). Moscow: Izdatel’stvo Moskovskogo universiteta.
Maurer, H. 1966. Grammatik der Tangga–Sprache (Melanesien). (Micro–Bibliotheca Anthropos 40). Bonn: Anthropos Institut.
Zaliznjak, A. A. 1973. O ponimanii termina ‘padež’ v lingvistièeskix opisanijax. In: A. A. Zaliznjak (ed.) Problemy         grammatièeskogo modelirovanija, 53–87. Moscow: Nauka.