The GOLD Effort So Far 1 |
![]() |
|
|
This paper describes the motivation for creating the General Ontology for Linguistic Description (GOLD) within the E-MELD project, namely as the core of a solution to the problem of rendering all electronically encoded linguistic data comparable. It summarizes the history of the development of GOLD, describes a specific design problem that has arisen, and proposes a solution. |
|
The creation and development of the General Ontology for Linguistic Description (GOLD) grew out of the E-MELD's project's attempt to solve a problem related to the one posed by Doug Whalen in his closing remarks at the 2001 workshop in Santa Barbara that launched the project.
Most linguists are already capable of describing in text and diagram the data the way they want. The need to program the description arises when we want machines to understand it so that they can carry out additional tasks based on those descriptions that their human users cannot or will not perform. Whalen's problem can be solved by forming teams of linguists and programmers to write the programs and interfaces necessary to enable the rest of us to prepare machine-readable descriptions to our specifications, and that's already happening to a considerable extent. Whalen's problem is fully solved when we've figured out how to program all reasonable linguistic descriptions and make the data entry tools usable for the ordinary working linguist. But even if Whalen's problem is solved, another equally difficult problem remains, namely how do we deal with all the machine-readable descriptions that linguists around the world make available? In other words here's ...
Our problem is known as the interoperability problem.
The Text Encoding Initiative's "data interchange" solution to the
interoperability problem
We're not the first project to try to come to grips with the interoperability problem, and not even the first linguistics project to do so. The Text Encoding Initiative (TEI), which was launched in November 1987, and is still active, proposed a solution that called for a "data interchange" format. The basic idea is this. Suppose there are two projects X and Y that describe some bits of language data using their own non-interoperable coding schemes. How do they share their data? They agree on an interchange format, call it P3. Data in X is transformed to P3 using some mapping , and sent to Y. Data in Y is transformed to P3 using some mapping Φ, and sent to X. When X receives Y's data, it performs the inverse of Φ; and when Y receives X's data it performs the inverse of Ψ. You can see that this is not an ideal solution. A great deal of effort has to go into building, maintaining and updating P3. Moreover, each project that wants to share data with others has to work out mapping rules to P3 from to its local coding scheme and from P3 to its local coding scheme. So what's the point of having all these local coding schemes and mapping rules? Why not simply have everybody use P3? That's pretty much what has happened with TEI. Its interchange format (which at one time was called P3) contains highly detailed specifications for the encoding of all kinds of textual information, including linguistic descriptions. However very few linguists use it. So even if the interoperability problem for linguistics has been solved in principle, it hasn't been solved in practice. To solve it in practice, we have to make the solution practical.
Some lessons from the TEI
Although TEI's solution to the interoperability problem is impractical, they have gotten a number of things right that we would be wise to adopt in our effort to solve it. The first is to recommend an open-source markup language that is expressively rich enough for everyone to use to describe their data. The TEI now uses XML; they started out with something called SGML, which was the only game in town in 1987. E-MELD agreed at the outset in 2001 to use XML. The second is to recognize that individual projects don't have to use XML locally, as long as their software exports to XML.2 XML is fully capable of expressing the content of relational databases, but is not particularly well suited for data entry into a database format. (Maybe that will change soon.) Data exportation carries out mapping rules like those described above, and there is no real need for inverse mappings, i.e. importation from XML.
XML markup is syntax
XML provides a way of specifying the syntactic structure of markup. So for example, the TEI recommends the use of tags to describe pieces of text as "sentences", "words" and morphemes", and identifies them using as < s >, < w > and < m > respectively. But nothing in the specification requires that they be interpreted as those things. They can be used to describe any three-level hierarchical arrangement of character strings, including:
So how can we provide a semantics for XML syntax, so that a machine would know that, for example, the tag < w > means more than "whatever is contained in < s > and contains < m >"?
Two avenues to markup semantics
One way to achieve a semantics for linguistic markup suitable for machine processing is to follow TEI's lead by proposing a single set of markup standards, conventions and definitions which tie every piece of markup (tag, attribute, and designated content) to a specific meaning, and which seek to associate every concept in linguistic analysis with some piece of markup. Let us call this approach "The Syntax is the Semantics" (SIS). It can be thought of as the "brute force" solution to the interoperability problem. The other is to leave the XML markup syntax that linguists use uninterpreted (except insofar as the syntax can be interpreted internally, as in the TEI text-chunking example above) and to provide external but universally accessible (i.e. open-source Web) resources that encoders can use to interpret their markup further, and — here lies the key to the interoperability problem — to do so consistent with everyone else's interpretations. Let us call this approach "Leave the Semantics to Us" (LSU). We might also label this the "Semantic Web" approach, as that is the name given to this approach in other areas.
Problems with SIS
Without the World Wide Web and its resources available to us, SIS is the only viable approach to the interoperability problem, so it's not surprising that TEI, whose basic design was in place before the Web had become widely known and used, came up with it. The main problem with it, assuming that we have the resources to build a huge XML schema for linguistic description, is, as we have already observed, inducing linguists to use that schema. In addition, it would be very costly to retrofit existing resources to conform to that schema, and future changes to the schema would be very likely to create serious instances of downward incompatibility.
Advantages of LSU
One reason linguists would look with disfavor on a SIS solution is that for their own purposes, they would only need small pieces of the solution.3 LSU on the other hand lends itself to the creation of a variety of number of special-purpose markup schemes that are made available, say, in a linguist's toolkit and "tuned" to particular descriptive needs. Schemes that are developed for widely-shared needs could be refined over time and recommended as "best practice". However, resources for interpreting non-best practice markup schemes could still be provided. In addition, converting non-XML-encoded "legacy" and database resources to XML and subsequently to best practice is more easily carried out under LSU, since the target XML schemas are simpler, and in the case of exporting data in databases, the targets can be made to model the database structures. Finally, changes to the external semantics are less likely to break local applications, though alas they can, as we point out below.
Place of a linguistic ontology in an interoperable LSU solution
The central component of an LSU solution to the interoperability problem is a linguistic ontology that:
Since the definitions and relations in an ontology need to be machine interpretable, it must be written in a programming language. And since it needs to interact with XML files, it should be built "on top of" XML. Such a language is OWL, the Web Ontology Language under construction within the World Wide Web Community, which is built on top of RDF, the Resource Description Format, which in turn is built on top of XML. Moreover, OWL comes in a number of "flavors" that support different powers of reasoning, i.e. drawing conclusions from premises. The most widely used variety is OWL-DL, DL for "description logic". |
|
At the Santa Barbara workshop that kicked off the E-MELD project in June 2001, Terry described the problems with TEI's SIS approach, but didn't recommend an alternative. The Markup Working Group that he chaired at that workshop recommended the use of XML as the markup language of choice for E-MELD and the creation of lists of concepts to be used in markup, together with alternative names with an eye toward at least achieving terminological interoperability. Shortly after the workshop ended, Scott convinced Will and Terry that solving the interoperability problem in its entirety should be our number one goal, and that the Arizona E-MELD team (E-MELD AZ) should therefore immediately begin work on building a linguistic ontology. Terry got the green light from E-MELD central command following a presentation of these ideas at Eastern Michigan University in August 2001. E-MELD AZ first publicly declared its intention to build a linguistic ontology at the OLAC workshop in Philadelphia in December 2001.4
GOLD gets its name
In presentations at workshops at the LREC conference in May 2002 and the AAAI conference in July 2002,5 ,6 E-MELD AZ proposed to build a linguistic ontology as a component of a Semantic Web application that would render linguistic resources on the Web accessible to "smart searching". They didn't agree on the name General Ontology for Linguistic Description (GOLD) until after the AAAI workshop. Terry first used the name in presentations at the University of Utrecht and the Max Planck Institute for Psycholinguistics in Nijmegen in September 2002, and it appeared in print for the first time the following year.7
First major tests of GOLD in an LSU environment
Last year, E-MELD AZ, together with Gary, Scott, and Will's NSF SGER team at CSU Fresno, showed that GOLD could be used for smart searching across massive cross-linguistic databases created from XML documents containing interlinear text8 and XML-encoded lexicons.9 We considered these tests to constitute "proof of concept" of the LSU solution to the interoperability problem for linguistic data.
The GOLD summit
Last November, Will hosted a summit meeting of researchers most involved with GOLD to plan for its further development and maintenance after Arizona's E-MELD funding has run out, which is today! It recommended among other things the followsing.
|
|
We're proposing to move GOLD "out of the lab" effective with this meeting despite the fact that:
Let us say more about our reasons for this redesign.
Use of classes and instances in GOLD 0.1 ("Old GOLD")
Last year's applications were intended as proofs of concept, and had relatively limited goals: querying similar, loosely structured sets of data while demonstrating some ability to reason. To do this, we used off-the-shelf open-source components, and were limited to their capabilities. In RDF/OWL, there are two direct ways of relating pieces of information: class hierarchies and properties. Class hierarchies are the easiest to use for reasoning: if a particular instance i is of type A and A is a subclass of B, then the system can conclude that i is also of type B. For example, since TransitiveVerb and IntransitiveVerb are both subclasses of Verb, a search for all instances of Verb, will find all instances of TransitiveVerb and IntransitiveVerb. Similarly if PresentTense and PastTense are subclasses of NonFutureTense, a search for all instances of NonFutureTense will find all instances of PresentTense and PastTense.
However, we also need to express how concepts in specific languages relate to each other, such as, "In language X, verbs are inflected only for tense". OWL-DL has some more sophisticated ways of dealing with information like that, but with an RDF reasoner, we were limited to statements like: "Verb inflectedFor Tense". But this has two problems with it.
To overcome these limitations, we made all concepts in GOLD 0.1 into classes, whose instances are language-specific concepts. So the above statement becomes: "XVerb inflectedFor XTense," where XVerb is an instance of Verb and XTense is an instance of Tense. The fact that in X verbs are inflected only for tense would be discovered by asking for all the things that XVerb is inflected for. This approach worked well for last year's application because it is relatively simple and tools are optimized to reason quite quickly over the type and subclassOf properties. However, it is also limiting. Since classes cannot be related by arbitrary properties and every concept in GOLD is a class, it is difficult to articulate the relations between GOLD concepts, in particular the relation between grammatical features and their values. While it is easy to specify that XTense has the values XFutureTense and XNonFutureTense using a property like hasValue, it is not so easy to specify that Tense itself (i.e. the non-language-specific concept) has the possible values FutureTense and NonFutureTense, since both the subject and object of the hasValue property would be classes, and that's not allowed.10
Changes in GOLD 0.2
In developing GOLD 0.2, we relaxed the requirements that GOLD concepts be restricted to classes and that language-specific concepts be restricted to instances of those classes as follows.
For example in GOLD 0.2, TenseFeature has 29 instances (values), as shown in Fig. 16 in the print version included in the workshop notebook. They are related by an "entails" property that is defined as (logically) transitive, and as having an inverse, called "entailedBy" (not shown in the print version, but included in the html version on www.linguistics-ontology.org). These are legitimate OWL-DL constructs, as they are instance-to-instance relations. Since TenseFeature in GOLD is a class, it can have GOLD-defined subclasses, representing various feature systems, i.e. classes of values also related by the entails and entailedBy properties. We didn't define any Tense feature systems in GOLD; we only attempted a few (for Number and Person) for demonstration purposes. It's not difficult to come up with Tense systems that are widely attested in the world's languages, for example {AnyTense, NonPastTense, HodiernalPastTense, PreHodiernalPastTense}. Let's call this system TenseSystem-x, which we can diagram as in Figure 1 (making liberal use of shorthand), the arcs representing entails (upward) and entailedBy (downward). TenseSystem-x is a subclass of TenseFeature, preserving the entails and entailedBy properties of its instances; that is, it is also a substructure of TenseFeature, where the properties in question provide the structure.
Figure 1 TenseSystem-x as a substructure of TenseFeature
Now suppose we have a language X whose Tense system is isomorphic to TenseSystem-x; that is it has four instances (values) that map one to one to the values in TenseSystem-x in GOLD as shown in Figure 2, and that have the same properties (entails and entailedBy) as shown in Figure 3. Call that system XTense, which can now be represented as a class concept in a profile or COPE for that language.
Figure 2 Mapping to GOLD TenseSystem-x from XTense
Figure 3 XTense system isomorphic to TenseSystem-x
By mapping XPres(ent) to GOLD NonP(ast), XRec(ent)P(ast) to GOLD Hod(iernal)P(ast), and XRem(ote)P(ast) to GOLD PreHod(iernal)P(ast), the encoder is claiming that XPresent relates most closely to GOLD NonPast, not to its GOLD namesake Present, etc. (i.e. it's called "Present" in this description of X, even though it means "NonPast", etc.). In this way, we not only have a term mapping, but a structure mapping.11 We see one of the clear advantages of the approach we're proposing is that it provides a straightforward way to doing structural mapping using profiles and COPEs. We set up GOLD structures (classes with structural relations like "entails", but it could be others like "constituentOf", defined over their instances), and provide ways of mapping structures defined for specific languages to those GOLD structures.
Creating Feature and Feature Class Files
The task of compiling feature values and classes to populate the GOLD ontology involved a number of steps. Group members performed a literature review on various morpho-syntactic topics, including modality, voice, aspect, tense, person, number, polarity, case, evidentiality, gender, mood, evaluatives and force.
Beforehand, we determined what kinds of information would be collected for each term, and for consistency created an Excel template for this information. The following pieces of data were recorded:
Creation of Hierarchical Structure in GOLD
One of the main challenges of importing these terms was determining what kind of structure exists between the core concepts in GOLD, as well as what internal structure exists in categories such as Case or Number. For each category, we went through the terms that were collected and decided which were the most relevant to a certain class or category. We then attempted to break up these terms into salient groupings with the help of literature dealing with this topic. In order to do this, it was necessary to discuss what kinds of domains were dealt with by each class, and how the individual terms related to each other in this domain. A relatively simple case of this is tense, which deals with placement along a timeline, and the terms that were grouped together dealt with a similar placement relevant to the present moment. These groupings of terms where then placed relative to each other in the overall hierarchy, in an attempt to allow for the attested linguistic usages seen in the literature and examples that were collected, as well as leaving room for possible usages that could occur within the domain. |
|
Since GOLD was created as part of the E-MELD project, control of its development was limited to the project's active participants. We now propose to pass control to a body of overseers responsible to the GOLD Community, consisting of everyone who has an intellectual or practical stake in its success. Many of us who have been involved in the work so far are eager to continue, but we wish to make it clear that we are now simply partners with you in its future development. |
|
1. We thank everyone else who's worked on E-MELD at U Arizona between 2001 and 2005, especially Scott Farrar, Will Lewis, Ruby Basham, Peter Norquest, Shauna Eggers, Alexis Lanham, Jesse Kirchner, and Sandy Chow. We also thank everyone who's worked on E-MELD elsewhere, especially Anthony Aristar, Helen Aristar-Dry, Laura Buszard-Welcher, Zhenwei Chen, Jeff Good, Baden Hughes, Gary Simons, and Doug Whalen. Return to 1 2. More needs to be said about this. There are smarter and dumber ways to export to XML, and the results of dumb exportation can be pretty painful. Return to 2 3. TEI has attempted to overcome this problem by modularizing its SIS schema (the "pizza" model), so that encoders can select from a menu of "toppings" to a base schema. Return to 3 4. Lewis, W.D., S. Farrar & D.T. Langendoen (2001) Building a Knowledge Base of Morphosyntactic Terminology. Proceedings of the IRCS Workshop on Linguistic Databases, December 2001, pp. 150-156. Philadelphia: Institute for Research in Cognitive Science, University of Pennsylvania. <http://emeld.org/documents/IRCS-BuildingKnowledgeBase.pdf> Return to 4 5. Farrar, S., D.T. Langendoen & W.D. Lewis (2002) Bridging the Markup Gap: Smart Search Engines for Language Researchers. Proceedings of the LREC Workshop on Resources and Tools for Field Linguistics, Las Palmas, May 2002. <http://emeld.org/documents/LREC-BridgingMarkupGap.pdf> Return to 5 6. Farrar, S., W.D. Lewis & D.T. Langendoen (2002) An Ontology for Linguistic Annotation. Semantic Web Meets Language Resources: Papers from the AAAI Workshop, Edmonton, July 2002. Technical Report WS-02-16, pp. 11-19. Menlo Park, CA: AAAI Press. <http://emeld.org/documents/AAAI-OntologyLinguisticAnnotation.pdf> Return to 6 7. Farrar, S. & D.T. Langendoen (2003) A Linguistic Ontology for the Semantic Web. GLOT International 7(3).97-100. <http://emeld.org/documents/GLOT-LinguisticOntology.pdf> Return to 7 8. Simons, G.F., B. Fitzsimons, D.T. Langendoen, W.D. Lewis, S. Farrar, A. Lanham, R. Basham & X. Gonzalez (2004) A Model for Interoperability: XML Documents as an RDF Database. Proceedings of the 2004 E-MELD Workshop, Detroit, July 2004. <http://emeld.org/workshop/2004/langendoen-paper.html> Return to 8 9. Simons, G.F., W.D. Lewis, S. Farrar, D.T. Langendoen, B. Fitzsimons & X. Gonzalez (2004) The semantics of markup: Mapping legacy markup schemas to a common semantics. Proceedings of the XMLNLP Workshop, Barcelona, July 2004. <http://emeld.org/documents/SOMFinal1col.pdf> Return to 9 10. One solution would be to make feature values subclasses of their corresponding features, F0.2but this would effectively obliterate the conceptual distinction between features and values. For example, the relation between Tense and PastTense would be the same as the relation between PastTense and HodiernalPastTense, namely the second is a subclass of the first. Return to 10 11. Although we described the mapping between TenseSystem-x and XTense as an isomorphism, in general the mapping from language feature values to GOLD feature values is a function without a functional inverse. For example, while each language value should map uniquely to a GOLD value, the converse is not true. A single GOLD value may map to more than one language value. Note also that two distinct language values may map to the same GOLD value. Return to 11 |