The GOLD Effort So Far 1 

    
Terry Langendoen, Brian Fitzsimons, Emily Kidder (University of Arizona)

     Abstract

This paper describes the motivation for creating the General Ontology for Linguistic Description (GOLD) within the E-MELD project, namely as the core of a solution to the problem of rendering all electronically encoded linguistic data comparable. It summarizes the history of the development of GOLD, describes a specific design problem that has arisen, and proposes a solution.


     The goal that led to GOLD

The creation and development of the General Ontology for Linguistic Description (GOLD) grew out of the E-MELD's project's attempt to solve a problem related to the one posed by Doug Whalen in his closing remarks at the 2001 workshop in Santa Barbara that launched the project.

Whalen's Problem. We want to be able to describe the data in just the way we want, but we don't want to program it. ( http://linguist.emich.edu/~workshop/reports/whalen_workshop.html)

Most linguists are already capable of describing in text and diagram the data the way they want. The need to program the description arises when we want machines to understand it so that they can carry out additional tasks based on those descriptions that their human users cannot or will not perform.

Whalen's problem can be solved by forming teams of linguists and programmers to write the programs and interfaces necessary to enable the rest of us to prepare machine-readable descriptions to our specifications, and that's already happening to a considerable extent. Whalen's problem is fully solved when we've figured out how to program all reasonable linguistic descriptions and make the data entry tools usable for the ordinary working linguist. But even if Whalen's problem is solved, another equally difficult problem remains, namely how do we deal with all the machine-readable descriptions that linguists around the world make available? In other words here's ...

Our problem. We want to be able to describe the data in just the way we want, and we want to be able to use everybody else's data described in just the way they want, and we want to be able to process it in all kinds of ways that make sense to us as scientists and teachers.

Our problem is known as the interoperability problem.

The Text Encoding Initiative's "data interchange" solution to the interoperability problem

We're not the first project to try to come to grips with the interoperability problem, and not even the first linguistics project to do so. The Text Encoding Initiative (TEI), which was launched in November 1987, and is still active, proposed a solution that called for a "data interchange" format. The basic idea is this. Suppose there are two projects X and Y that describe some bits of language data using their own non-interoperable coding schemes. How do they share their data? They agree on an interchange format, call it P3. Data in X is transformed to P3 using some mapping , and sent to Y. Data in Y is transformed to P3 using some mapping Φ, and sent to X. When X receives Y's data, it performs the inverse of Φ; and when Y receives X's data it performs the inverse of Ψ.

You can see that this is not an ideal solution. A great deal of effort has to go into building, maintaining and updating P3. Moreover, each project that wants to share data with others has to work out mapping rules to P3 from to its local coding scheme and from P3 to its local coding scheme. So what's the point of having all these local coding schemes and mapping rules? Why not simply have everybody use P3? That's pretty much what has happened with TEI. Its interchange format (which at one time was called P3) contains highly detailed specifications for the encoding of all kinds of textual information, including linguistic descriptions. However very few linguists use it. So even if the interoperability problem for linguistics has been solved in principle, it hasn't been solved in practice. To solve it in practice, we have to make the solution practical.

Some lessons from the TEI

Although TEI's solution to the interoperability problem is impractical, they have gotten a number of things right that we would be wise to adopt in our effort to solve it. The first is to recommend an open-source markup language that is expressively rich enough for everyone to use to describe their data. The TEI now uses XML; they started out with something called SGML, which was the only game in town in 1987. E-MELD agreed at the outset in 2001 to use XML. The second is to recognize that individual projects don't have to use XML locally, as long as their software exports to XML.2  XML is fully capable of expressing the content of relational databases, but is not particularly well suited for data entry into a database format. (Maybe that will change soon.) Data exportation carries out mapping rules like those described above, and there is no real need for inverse mappings, i.e. importation from XML.

XML markup is syntax

XML provides a way of specifying the syntactic structure of markup. So for example, the TEI recommends the use of tags to describe pieces of text as "sentences", "words" and morphemes", and identifies them using as < s >, < w > and < m > respectively. But nothing in the specification requires that they be interpreted as those things. They can be used to describe any three-level hierarchical arrangement of character strings, including:

< s > = sentence, < w > = word, < m > = morpheme
< s > = paragraph, < w > = sentence, < m > = word
< s > = chapter, < w > = paragraph, < m > = morpheme
< s > = big chunk, < w > = middle-size chunk, < m > = small chunk

So how can we provide a semantics for XML syntax, so that a machine would know that, for example, the tag < w > means more than "whatever is contained in < s > and contains < m >"?

Two avenues to markup semantics

One way to achieve a semantics for linguistic markup suitable for machine processing is to follow TEI's lead by proposing a single set of markup standards, conventions and definitions which tie every piece of markup (tag, attribute, and designated content) to a specific meaning, and which seek to associate every concept in linguistic analysis with some piece of markup. Let us call this approach "The Syntax is the Semantics" (SIS). It can be thought of as the "brute force" solution to the interoperability problem.

The other is to leave the XML markup syntax that linguists use uninterpreted (except insofar as the syntax can be interpreted internally, as in the TEI text-chunking example above) and to provide external but universally accessible (i.e. open-source Web) resources that encoders can use to interpret their markup further, and — here lies the key to the interoperability problem — to do so consistent with everyone else's interpretations. Let us call this approach "Leave the Semantics to Us" (LSU). We might also label this the "Semantic Web" approach, as that is the name given to this approach in other areas.

Problems with SIS

Without the World Wide Web and its resources available to us, SIS is the only viable approach to the interoperability problem, so it's not surprising that TEI, whose basic design was in place before the Web had become widely known and used, came up with it. The main problem with it, assuming that we have the resources to build a huge XML schema for linguistic description, is, as we have already observed, inducing linguists to use that schema. In addition, it would be very costly to retrofit existing resources to conform to that schema, and future changes to the schema would be very likely to create serious instances of downward incompatibility.

Advantages of LSU

One reason linguists would look with disfavor on a SIS solution is that for their own purposes, they would only need small pieces of the solution.3  LSU on the other hand lends itself to the creation of a variety of number of special-purpose markup schemes that are made available, say, in a linguist's toolkit and "tuned" to particular descriptive needs. Schemes that are developed for widely-shared needs could be refined over time and recommended as "best practice". However, resources for interpreting non-best practice markup schemes could still be provided.

In addition, converting non-XML-encoded "legacy" and database resources to XML and subsequently to best practice is more easily carried out under LSU, since the target XML schemas are simpler, and in the case of exporting data in databases, the targets can be made to model the database structures. Finally, changes to the external semantics are less likely to break local applications, though alas they can, as we point out below.

Place of a linguistic ontology in an interoperable LSU solution

The central component of an LSU solution to the interoperability problem is a linguistic ontology that:

  1. defines the common concepts used in linguistic analysis and description,
  2. expresses the relations that hold among those concepts,
  3. relates those concepts to concepts of common-sense understanding ("upper" ontology) and concepts in other disciplines.

Since the definitions and relations in an ontology need to be machine interpretable, it must be written in a programming language. And since it needs to interact with XML files, it should be built "on top of" XML. Such a language is OWL, the Web Ontology Language under construction within the World Wide Web Community, which is built on top of RDF, the Resource Description Format, which in turn is built on top of XML. Moreover, OWL comes in a number of "flavors" that support different powers of reasoning, i.e. drawing conclusions from premises. The most widely used variety is OWL-DL, DL for "description logic".


     Creating and developing GOLD within E-MELD

At the Santa Barbara workshop that kicked off the E-MELD project in June 2001, Terry described the problems with TEI's SIS approach, but didn't recommend an alternative. The Markup Working Group that he chaired at that workshop recommended the use of XML as the markup language of choice for E-MELD and the creation of lists of concepts to be used in markup, together with alternative names with an eye toward at least achieving terminological interoperability. Shortly after the workshop ended, Scott convinced Will and Terry that solving the interoperability problem in its entirety should be our number one goal, and that the Arizona E-MELD team (E-MELD AZ) should therefore immediately begin work on building a linguistic ontology. Terry got the green light from E-MELD central command following a presentation of these ideas at Eastern Michigan University in August 2001. E-MELD AZ first publicly declared its intention to build a linguistic ontology at the OLAC workshop in Philadelphia in December 2001.4  

GOLD gets its name

In presentations at workshops at the LREC conference in May 2002 and the AAAI conference in July 2002,5 ,6  E-MELD AZ proposed to build a linguistic ontology as a component of a Semantic Web application that would render linguistic resources on the Web accessible to "smart searching". They didn't agree on the name General Ontology for Linguistic Description (GOLD) until after the AAAI workshop. Terry first used the name in presentations at the University of Utrecht and the Max Planck Institute for Psycholinguistics in Nijmegen in September 2002, and it appeared in print for the first time the following year.7 

First major tests of GOLD in an LSU environment

Last year, E-MELD AZ, together with Gary, Scott, and Will's NSF SGER team at CSU Fresno, showed that GOLD could be used for smart searching across massive cross-linguistic databases created from XML documents containing interlinear text8  and XML-encoded lexicons.9  We considered these tests to constitute "proof of concept" of the LSU solution to the interoperability problem for linguistic data.

The GOLD summit

Last November, Will hosted a summit meeting of researchers most involved with GOLD to plan for its further development and maintenance after Arizona's E-MELD funding has run out, which is today! It recommended among other things the followsing.

  1. Creating a GOLD website, which Baden Hughes has taken care of, at www.linguistics-ontology.org
  2. Forming a GOLD Council with oversight responsibility, and putting procedures in place using the OLAC model to foster and evaluate development and maintenance.
  3. Focusing the E-MELD 2005 workshop on GOLD.


     Current state of play

We're proposing to move GOLD "out of the lab" effective with this meeting despite the fact that:

  1. GOLD version 0.2 has very small coverage, with most areas of the field are not covered at all;
  2. we have not settled on an upper ontology to connect to (currently SUMO);
  3. some "core GOLD" concepts are in flux;
  4. E-MELD AZ (that's us, now) broke last year's proof of concept tests for interoperability with our redesign of the treatment of grammatical features.

Let us say more about our reasons for this redesign.

Use of classes and instances in GOLD 0.1 ("Old GOLD")

Last year's applications were intended as proofs of concept, and had relatively limited goals: querying similar, loosely structured sets of data while demonstrating some ability to reason. To do this, we used off-the-shelf open-source components, and were limited to their capabilities.

In RDF/OWL, there are two direct ways of relating pieces of information: class hierarchies and properties. Class hierarchies are the easiest to use for reasoning: if a particular instance i is of type A and A is a subclass of B, then the system can conclude that i is also of type B. For example, since TransitiveVerb and IntransitiveVerb are both subclasses of Verb, a search for all instances of Verb, will find all instances of TransitiveVerb and IntransitiveVerb. Similarly if PresentTense and PastTense are subclasses of NonFutureTense, a search for all instances of NonFutureTense will find all instances of PresentTense and PastTense.

However, we also need to express how concepts in specific languages relate to each other, such as, "In language X, verbs are inflected only for tense". OWL-DL has some more sophisticated ways of dealing with information like that, but with an RDF reasoner, we were limited to statements like: "Verb inflectedFor Tense". But this has two problems with it.

  1. Classes cannot serve as subjects or objects of statements, except with specific properties, such as subclassOf.
  2. It does not correctly encode the fact that verbs are inflected only for tense in that particular language.

To overcome these limitations, we made all concepts in GOLD 0.1 into classes, whose instances are language-specific concepts. So the above statement becomes: "XVerb inflectedFor XTense," where XVerb is an instance of Verb and XTense is an instance of Tense. The fact that in X verbs are inflected only for tense would be discovered by asking for all the things that XVerb is inflected for.

This approach worked well for last year's application because it is relatively simple and tools are optimized to reason quite quickly over the type and subclassOf properties. However, it is also limiting. Since classes cannot be related by arbitrary properties and every concept in GOLD is a class, it is difficult to articulate the relations between GOLD concepts, in particular the relation between grammatical features and their values. While it is easy to specify that XTense has the values XFutureTense and XNonFutureTense using a property like hasValue, it is not so easy to specify that Tense itself (i.e. the non-language-specific concept) has the possible values FutureTense and NonFutureTense, since both the subject and object of the hasValue property would be classes, and that's not allowed.10 

Changes in GOLD 0.2

In developing GOLD 0.2, we relaxed the requirements that GOLD concepts be restricted to classes and that language-specific concepts be restricted to instances of those classes as follows.

  1. Allow certain GOLD concepts to be instances of other GOLD classes. In particular, define atomic feature values as instances of particular feature classes.
  2. Allow certain language-specific concepts to be classes that are instantiated by other language-specific concepts. In particular, define language-specific features as classes instantiated by their language-specific values.

For example in GOLD 0.2, TenseFeature has 29 instances (values), as shown in Fig. 16 in the print version included in the workshop notebook. They are related by an "entails" property that is defined as (logically) transitive, and as having an inverse, called "entailedBy" (not shown in the print version, but included in the html version on www.linguistics-ontology.org). These are legitimate OWL-DL constructs, as they are instance-to-instance relations.

Since TenseFeature in GOLD is a class, it can have GOLD-defined subclasses, representing various feature systems, i.e. classes of values also related by the entails and entailedBy properties. We didn't define any Tense feature systems in GOLD; we only attempted a few (for Number and Person) for demonstration purposes. It's not difficult to come up with Tense systems that are widely attested in the world's languages, for example {AnyTense, NonPastTense, HodiernalPastTense, PreHodiernalPastTense}. Let's call this system TenseSystem-x, which we can diagram as in Figure 1 (making liberal use of shorthand), the arcs representing entails (upward) and entailedBy (downward). TenseSystem-x is a subclass of TenseFeature, preserving the entails and entailedBy properties of its instances; that is, it is also a substructure of TenseFeature, where the properties in question provide the structure.


Figure 1 TenseSystem-x as a substructure of TenseFeature

Now suppose we have a language X whose Tense system is isomorphic to TenseSystem-x; that is it has four instances (values) that map one to one to the values in TenseSystem-x in GOLD as shown in Figure 2, and that have the same properties (entails and entailedBy) as shown in Figure 3. Call that system XTense, which can now be represented as a class concept in a profile or COPE for that language.


Figure 2 Mapping to GOLD TenseSystem-x from XTense


Figure 3 XTense system isomorphic to TenseSystem-x

By mapping XPres(ent) to GOLD NonP(ast), XRec(ent)P(ast) to GOLD Hod(iernal)P(ast), and XRem(ote)P(ast) to GOLD PreHod(iernal)P(ast), the encoder is claiming that XPresent relates most closely to GOLD NonPast, not to its GOLD namesake Present, etc. (i.e. it's called "Present" in this description of X, even though it means "NonPast", etc.). In this way, we not only have a term mapping, but a structure mapping.11 

We see one of the clear advantages of the approach we're proposing is that it provides a straightforward way to doing structural mapping using profiles and COPEs. We set up GOLD structures (classes with structural relations like "entails", but it could be others like "constituentOf", defined over their instances), and provide ways of mapping structures defined for specific languages to those GOLD structures.

Creating Feature and Feature Class Files

The task of compiling feature values and classes to populate the GOLD ontology involved a number of steps. Group members performed a literature review on various morpho-syntactic topics, including modality, voice, aspect, tense, person, number, polarity, case, evidentiality, gender, mood, evaluatives and force.

Beforehand, we determined what kinds of information would be collected for each term, and for consistency created an Excel template for this information. The following pieces of data were recorded:

  1. Term name - Primary name for term.
  2. Alternate name - Any other names cited in the source as identical in meaning to the recorded term.
  3. Definition - Semantic function of the term, in some cases the definition was taken verbatim from the source, and in others it was paraphrased when necessary. A full bibliographic information on the reference for the definition was also included.
  4. Example - For each term, an example that represented the concept was sought. The interlinear glossed text of the example was encoded into XML for later incorporation. When possible, examples were collected from endangered languages. References were included as for definitions.
  5. Ancestor in GOLD - The group that each term belonged to was recorded if possible for later use in determining the overarching hierarchical structure. For example, the term MiddleVoice was the Ancestor in GOLD to ReflexiveMiddleVoice.
  6. Comments - Specific comments for each term were included when relevant, such as when a term was language specific, or if there was a question or controversy about the definition.

Creation of Hierarchical Structure in GOLD

One of the main challenges of importing these terms was determining what kind of structure exists between the core concepts in GOLD, as well as what internal structure exists in categories such as Case or Number.

For each category, we went through the terms that were collected and decided which were the most relevant to a certain class or category. We then attempted to break up these terms into salient groupings with the help of literature dealing with this topic. In order to do this, it was necessary to discuss what kinds of domains were dealt with by each class, and how the individual terms related to each other in this domain. A relatively simple case of this is tense, which deals with placement along a timeline, and the terms that were grouped together dealt with a similar placement relevant to the present moment. These groupings of terms where then placed relative to each other in the overall hierarchy, in an attempt to allow for the attested linguistic usages seen in the literature and examples that were collected, as well as leaving room for possible usages that could occur within the domain.


     The Future of GOLD

Since GOLD was created as part of the E-MELD project, control of its development was limited to the project's active participants. We now propose to pass control to a body of overseers responsible to the GOLD Community, consisting of everyone who has an intellectual or practical stake in its success. Many of us who have been involved in the work so far are eager to continue, but we wish to make it clear that we are now simply partners with you in its future development.


1. We thank everyone else who's worked on E-MELD at U Arizona between 2001 and 2005, especially Scott Farrar, Will Lewis, Ruby Basham, Peter Norquest, Shauna Eggers, Alexis Lanham, Jesse Kirchner, and Sandy Chow. We also thank everyone who's worked on E-MELD elsewhere, especially Anthony Aristar, Helen Aristar-Dry, Laura Buszard-Welcher, Zhenwei Chen, Jeff Good, Baden Hughes, Gary Simons, and Doug Whalen. Return to 1
2. More needs to be said about this. There are smarter and dumber ways to export to XML, and the results of dumb exportation can be pretty painful. Return to 2
3. TEI has attempted to overcome this problem by modularizing its SIS schema (the "pizza" model), so that encoders can select from a menu of "toppings" to a base schema. Return to 3
4. Lewis, W.D., S. Farrar & D.T. Langendoen (2001) Building a Knowledge Base of Morphosyntactic Terminology. Proceedings of the IRCS Workshop on Linguistic Databases, December 2001, pp. 150-156. Philadelphia: Institute for Research in Cognitive Science, University of Pennsylvania. <http://emeld.org/documents/IRCS-BuildingKnowledgeBase.pdf> Return to 4
5. Farrar, S., D.T. Langendoen & W.D. Lewis (2002) Bridging the Markup Gap: Smart Search Engines for Language Researchers. Proceedings of the LREC Workshop on Resources and Tools for Field Linguistics, Las Palmas, May 2002. <http://emeld.org/documents/LREC-BridgingMarkupGap.pdf> Return to 5
6. Farrar, S., W.D. Lewis & D.T. Langendoen (2002) An Ontology for Linguistic Annotation. Semantic Web Meets Language Resources: Papers from the AAAI Workshop, Edmonton, July 2002. Technical Report WS-02-16, pp. 11-19. Menlo Park, CA: AAAI Press. <http://emeld.org/documents/AAAI-OntologyLinguisticAnnotation.pdf> Return to 6
7. Farrar, S. & D.T. Langendoen (2003) A Linguistic Ontology for the Semantic Web. GLOT International 7(3).97-100. <http://emeld.org/documents/GLOT-LinguisticOntology.pdf> Return to 7
8. Simons, G.F., B. Fitzsimons, D.T. Langendoen, W.D. Lewis, S. Farrar, A. Lanham, R. Basham & X. Gonzalez (2004) A Model for Interoperability: XML Documents as an RDF Database. Proceedings of the 2004 E-MELD Workshop, Detroit, July 2004. <http://emeld.org/workshop/2004/langendoen-paper.html> Return to 8
9. Simons, G.F., W.D. Lewis, S. Farrar, D.T. Langendoen, B. Fitzsimons & X. Gonzalez (2004) The semantics of markup: Mapping legacy markup schemas to a common semantics. Proceedings of the XMLNLP Workshop, Barcelona, July 2004. <http://emeld.org/documents/SOMFinal1col.pdf> Return to 9
10. One solution would be to make feature values subclasses of their corresponding features, F0.2but this would effectively obliterate the conceptual distinction between features and values. For example, the relation between Tense and PastTense would be the same as the relation between PastTense and HodiernalPastTense, namely the second is a subclass of the first. Return to 10
11. Although we described the mapping between TenseSystem-x and XTense as an isomorphism, in general the mapping from language feature values to GOLD feature values is a function without a functional inverse. For example, while each language value should map uniquely to a GOLD value, the converse is not true. A single GOLD value may map to more than one language value. Note also that two distinct language values may map to the same GOLD value. Return to 11