E-MELD Homepage Wayne State University Homepage Eastern Michigan University Homepage

E-MELD 2005 Ontology FAQ

The E-MELD team at the U. of Arizona has been building GOLD (General Ontology of Linguistic Concepts), which is the focus of the E-MELD 2005 workshop. This FAQ is designed to answer some of the questions that the E-MELD team has previously been asked about GOLD.

What is an Ontology?

The term "ontology" is one of those words that cause a lot of misunderstanding. In philosophy, of course, it refers to the study of what kinds of things exist, or the study of first principles and the essence of things. Since most linguists have some philosophical background or knowledge, this is the definition which they are most familiar with.

If this were what we meant by "building an ontology of linguistic concepts," then any attempt to formulate one would be a bad idea. Not only would it take a very long time, if it were possible to do it, but it's also probably beyond our abilities with our present state of linguistic knowledge. We would like to say immediately, then, that this is not what we are trying to do here.

Fortunately, the term ontology has a completely different meaning in information technology. An ontology here is essentially a machine-readable formal statement of a set of terms and a working model of the relationships holding among the concepts referred to by those terms in some particular domain of knowledge. Its purpose is not to define meaning, but to allow computers to navigate human knowledge in a way that mimics intelligence. What it is, then, is not anywhere near as important as what it allows computers to do.

What does an ontology allow computers to do?

By encoding the relationships among concepts in a machine-readable format, an ontology can allow for automated human-like reasoning, which will, in turn, enable intelligent searching of linguistic materials on the Internet. Given an ontology of linguistic concepts and language data systematically related to that ontology, computers can respond to many highly-specific linguistic queries.

To take a much over-simplified example, suppose a linguistic ontology had the following hierarchy:

Using this hierarchy and appropriately marked-up data, a machine could compare the number marking of different languages and answer queries such as "What kind of morphosyntactic number marking is found in this language?" "What languages have paucal number marking?" and "What languages have dual verb agreement?" For this purpose, the computer does not need to know what "number" means, only that "Singular, Dual, Paucal, and Plural" are considered numbers. Of course, the developers of GOLD are trying to incorporate accurate definitions of concepts. However, for many purposes, the ontology does not need to include the "perfect" definition of number, only to position "number" in a hierarchy of concepts which are related to each other in pre-defined ways.

Does this mean I would have to use only the terms in the ontology to describe my language?

Not at all. Obviously, for an ontology to function in the manner outlined above, search-engines have to find exactly the terms that the ontology tells them it should find. But this doesn't mean that linguists are being asked to abandon the terminology they are used to and use only terms sanctioned by GOLD. In fact, GOLD was developed in order to avoid trying to enforce the use of a standard terminology set. Although GOLD is now being developed to integrate with the semantic web (and thus to enable automated reasoning), initially GOLD was intended to function primarily as an inter-language among the many different terminology sets currently in use by linguists.

Different linguistic traditions, for example, use not only different terms for the same concept (e.g. "obviative" and "fourth person"), but the same term for different concepts (e.g., "absolutive" doesn't mean the same thing for Australian languages as it does for Uto-Aztecan). This multiplicity of terms constitutes a major obstacle for machine-handling of linguistic data. However, computers will be able to handle different sets of linguistic data intelligently if whatever term the linguist is using is related to a concept in the ontology. So, for example, suppose one linguist uses the term "Obviative" and the other "Fourth Person" to refer to the same concept. If they each relate their term to the same concept in an ontology, then a computer will be able to treat those two terms as equivalent for searching. Similarly, if two linguists use the term "Absolutive" in two different ways and relate their use of the term to different concepts in an ontology, a computer will know to treat the terms as different--even though they have the same "label". In neither case is any change required in the actual mark-up of the data.

How can different terms be "related" to concepts in the ontology?

This requires the existence of tools that are ontology-aware. Ideally, the linguist should be able to relate his/her terms to ontological concepts as part of the process of analyzing or preparing data. Tools should also provide support for cases where a linguist determines the necessary concept is not found in the ontology they are using. For example, the tool could allow the linguist to relate their concept to its "best match" in the ontology and, at the same time, send a note to an ontology developer that, perhaps, a new concept needs to be added.

Presently, a few prototype tools exist (for example, FIELD, Onto-Elan, and OntoGloss); and more will certainly be built if the linguistics community, or a significant part of it, espouses the use of an ontology for Internet-based data retrieval and analysis. One purpose of the E-MELD workshop is to acquaint linguists with the different linguistic ontologies which are currently under construction, and with their potential uses. Another purpose is to refine and extend the GOLD ontology to make it maximally useful to linguists working in different language areas and on different language structures.

What does GOLD look like?

A html representation of GOLD can be viewed at the GOLD community website. Click here for the print version in MS Word format.

The language I'm working on has categories that are not included in GOLD. How can GOLD handle these?

It would be impossible to include in GOLD all the concepts that linguists might need to describe the world's languages. So the developers of GOLD have decided on two ways to deal with new and/or language-specific categories.

Scenario 1: A linguist is using software which allows him to relate his terminology to GOLD. However, he finds that his language has a kind of past tense unlike anything heretofore discovered, and so none of the past tense concepts available in the GOLD hierarchy seems to be the right one to attach his term ("X-past") to. They should relate the new term to a more general concept which this new concept is a subtype of. That is, the linguist's best procedure may be to relate "X-past" simply to "past tense," adding a comment about the ways that it differs from "past tense" as defined in the ontology. This way, linguists searching for different kinds of past tense will discover this data--as undoubtedly they should, even though the data does not exemplify any of the expected types of past tense.

Scenario 2: A group of linguists, say Bantuists, normally uses terminology and/or concepts which are specific to their language area. This community should probably develop a Community of Practice Extension (COPE) to GOLD. A COPE is a kind of "mini-ontology" which is arranged so that its structure, concepts, and terminology are those linguists in a particular area are accustomed to. Each of the concepts in a COPE is linked to a concept in the main GOLD ontology, and is thus interpretable by it. But GOLD can remain in the background, and applications for Bantuists can be based on the COPE. Indeed, the Bantu COPE may constitute all the ontological apparatus that this community need ever be aware of.

This was considered desirable because GOLD itself is being structured to integrate with an upper ontology, the better to perform the functions available via the semantic web. As our understanding of upper ontologies changes, the structure of GOLD may change. But the tools and software already built for a particular research community should not fall victim to these changes. A COPE is a way of insulating some ontology-aware tools from unnecessary obsolescence; it is also a way of allowing a community to define once and re-use area-specific concepts with which it is familiar.

How can I learn more about COPEs?

More on COPEs and on the way that specific feature structures are handled in GOLD can be found at More on COPEs.

How can I learn more about GOLD?

You may wish to look at some of the presentations on GOLD which have been made at previous E-MELD workshops. In particular, you may wish to read A Model for Interoperability: XML Documents as an RDF Database from the 2004 Workshop and Markup and the GOLD Ontology from the 2003 Workshop.

In addition, the following papers will provide background information on GOLD:

  • Farrar, Scott and D. Terence Langendoen (2003). A Linguistic Ontology for the Semantic Web. GLOT International 7(3), pp. 97-100. [PDF]
  • Farrar, Scott, William D. Lewis and D. Terence Langendoen (2002). An Ontology for Linguistic Annotation. Semantic Web Meets Language Resources: Papers from the AAAI Workshop, Technical Report WS-02-16, pp. 11-19. Menlo Park, CA: AAAI Press. [PDF]
  • Farrar, Scott, D. Terence Langendoen and William D. Lewis (2002). Bridging the Markup Gap: Smart Search Engines for Language Researchers. Proceedings of the Workshop on Resources and Tools for Field Linguistics, Las Palmas, Canary Islands, Spain. [PDF]
  • Farrar, Scott, William D. Lewis and D. Terence Langendoen (2002) A Common Ontology for Linguistic Concepts. Proceedings of the Knowledge Technology Conference, Seattle, WA, March 2002. [PDF]
  • Lewis, William D., Scott Farrar and D. Terence Langendoen (2001) Building a Knowledge Base of Morphosyntactic Terminology. Proceedings of the IRCS Workshop on Linguistic Databases, pp. 150-156. Philadelphia: Institute for Research in Cognitive Science, University of Pennsylvania. [PDF]