Ontogloss: An Ontology-Based Annotation Tool

    
Farhad Mostowfi, Farshad Fotouhi & Anthony Aristar
(Department of Computer Science, Wayne State University, Detroit, Michigan)




     1. Introduction

OntoGloss is an ontology based annotation tool that uses pre–defined concepts in ontology to mark–up a document. The difference between regular annotation and ontology–based annotation is that in the former, the annotation is a plain text that is collected based on a fixed structure 21, while in the later, the annotation is a set of instances of classes and relations based on the domain ontology. In ontology-based annotation, the annotation process is the process of assigning the annotated text to a concept in the ontology (instantiating a class) or to a data type or relating it to another annotated text (instantiating a relation). Such annotation is in line with the requirements suggested in 10 and provides:

  • Expressive adequacy. Ontology–based annotation can get to any level of granularity from the general to the finest level.
  • Semantic adequacy. Ontologies are made of structures and operators that have formal semantics that can be shared and understood within the same community and with other applications.
  • Incrementality. Incrementality is observed in ontology-based annotation. One can access information at any stage of interpretation and create output with any degree of generalization. The merge and integration of ontology-based annotation is also possible.
  • Uniformity. The same structures and operators are used as building blocks throughout the annotation process.
  • Openness. The openness is guaranteed since no specific theory in representation is enforced.
  • Extensibility. Many tools have already been made for the Semantic Web 2,13 and many more have been promised. These tools make ontology–based annotation extensible.
  • Human readability. The annotated information is easily read by human as well as understood by machine.
  • Processability and explicitness. The semantic is formal enough that leaves less room for different interpretations by different applications.
  • Consistency. Ontology designer makes sure his/her ontology is consistent in representation and reasoning. Information that is committed to ontology (instances of ontology) is therefore consistent with regard to ontology as well.

Based on Bird’s definition 3: “Linguistic annotation covers any descriptive or analytic notations applied to raw language data”. For linguists, marking up a document is a way of preserving its content. This is more urgent in the case of languages that are in the danger of disappearing 14,5. Endangered languages can tremendously benefit from ontology–based annotations that explicitly express the semantic of the content. Ontology, as a way of formalizing knowledge, can help linguists to solve the incompatibility of the markup data in a multilingual annotation and search environment. Ontology captures the knowledge in the field in a generic form so that it can be understood, shared and reused by the community. Later, this knowledge can be used to automatically annotate morphemes, words or phrases in other documents.

Ontologies have been developed to share knowledge “between people and heterogeneous and distributed systems12. They are used in Knowledge Management, E–Commerce, Natural Language Applications, Intelligent Information Integration, Information Extraction and Information Retrieval 15. By formalizing terminology and relation between concepts in the field, ontologies make integration between different sources of information possible. Ontology is usually for the whole domain or sub–domain and not just for an application. Once experts develop ontology for a domain, it would be a resource for everybody else to use. Ontology in different areas are emerging and by the advent of ontology languages like OWL 16 it is becoming easier and easier to develop one from scratch or use those that are available as a starting point to develop the new ones.

OntoGloss uses the linguistic knowledge gathered through annotation by the community to automatically annotate other documents. For any annotated document, a set of RDF 20 triples is created and saved in the database. On the next visit to the same document, OntoGloss retrieves all the triples for the document from the database and marks all the annotated sections. As long as the structure of the document does not change dramatically (which is usually the case in linguistics) this would create the same annotated sections. OntoGloss uses Uniform Resource Identifies (URI) 20 to identify resources and represents relations between them. It keeps annotations separate from the actual documents and supports two modes of operations: local and remote. In the local mode, annotated data is saved locally and is used in annotating documents that are visited for the first time. In the remote or shared annotation server mode, linguist can add his/her annotated data to a server for the community to use.

OntoGloss has the following features:

  • Using different ontologies to mark-up documents, paragraphs, sentences, words and morphemes. It is independent of the selected ontology and can accommodate several ontologies at thE same time.
  • Annotating the document with drag and drop operation. Moving the mouse over an annotated selection, linguist can see the type of annotation.
  • Automatically annotating new documents based on the previously annotated documents.
  • The ability to use a lexical reference system. This lexical reference system might already exist, e.g. WordNet 26, or it can be built and added gradually within the OntoGloss. Like WordNet for English language, this system can be used as a resource during the annotation process providing synonymy, hyponymy and different senses for individual words.
  • Exporting annotation data into RDF format. RDF data can be loaded into an RDF repository like Sesame 4 with querying capabilities.
  • Keeping annotation separate from the actual document. Annotated data is saved in a database and is loaded during each visit to the document.
  • Annotating the whole document with general information like the name of the annotator, date and other information as specified in the Dublin Core.
  • Supporting local and remote annotation servers.

When a document is visited for the first time, OntoGloss compares each word with all the annotated text in the database and assigns the same type of annotation to words. This will serve as an initial suggestion and can be changed by the linguist if needed. Classes in the ontology are color–coded. An annotated text has the same color as the class that is used in annotation. This gives a visual clue to the linguist on the type of markup.

There are many text annotators available both as open source and as commercial products. What is different about a linguistic annotator is that words in linguistics are broken up into morphemes. OntoGloss is able to annotate morphemes in a word. For example, if xxxabc is composted of xxx with a suffix –abc, a linguist using OntoGloss is able to annotate each morpheme separately. In the automatic annotation of new documents, when OntoGloss finds yyyabc, it can determine if it has the same suffix [23] and annotate it with the same class in the ontology.

In section 2, we begin by introducing components of OntoGloss. After an introduction to ontology languages, we go into more details of a few modules namely Ontology Management Interface, Lexical reference Interface, Annotation Positioning and User Interface. In section 3 we look at the related works and it section 4, we look at some ideas for the future work.

    
2. OntoGloss Architecture

Figure 1 shows the OntoGloss architecture. In this figure:

  • Ontology Management and Browsing Interface. Provides a generic interface to different ontology representations. Currently ontologies written in OWL and RDF are supported.

    Figure 1. OntoGloss Architecture

  • Lexical Reference Interface. This database interface links OntoGloss to a lexical reference system like WordNet. Its job is to facilitate the annotation process with the help of a lexicon knowledge base. The lexical reference is different for each language and can be built–up during the annotation progresses.
  • RDF Repository Interface. The annotated data is loaded into an external RDF repository for querying and other functionalities like reasoning. Currently the interface exports data into the Sesame 4.
  • Auto Annotation Module. Data, which is annotated either on the local machine or resides on a server, can be used in annotating other documents. This module gets the information from the Annotation Database and applies them to other documents.
  • Annotation Positioning Module. This module is responsible for saving the location of annotation and retrieving it on the next visit to the document.
  • Information Extraction Module. This module does all the lower level information extraction including breaking down words to their morphemes, removal of white spaces and counting the number of occurrences of a word. The Auto Annotation module uses the output of Information Extraction Module while automatically annotating documents.
  • Annotation Database. This is the internal repository of annotated data plus information on the location of the annotation.
  • User Interface. The prototype is built on Microsoft Access database with embedded Microsoft Internet Explorer. Plans are underway to implement OntoGloss as an open-source application.
     2.1 Ontology Management and Browsing Interface

Ontology is made of a set of concepts in a domain with their attributes and relations. There are also constraints, axioms and other constructs that represent the general knowledge in the domain. Concepts or classes (either physical or abstract) are the basic blocks. Everything else in the ontology is meant to represent knowledge about these concepts. This knowledge might be just concept's attributes or it might be more elaborate like cardinality of properties of concepts explaining how classes are related to each other and other entities in the world. Relational properties are binary relations between two concepts. They might be symmetrical or transitive or both. A relation is symmetrical if both concepts are in the same relation with each other. A relation is transitive if relation between A and B and relation between B and C imply that there is a relation between A and C. Inverse relational property is the inverse of a relation like isParent and its inverse isChild. Concept hierarchy is a taxonomy of concepts that organizes concepts in a generalization and specialization relationship [9]. In what follows, we bring a quick introduction to RDF 20, RDF Schema 19 and OWL 16 and then introduce the schema that we have picked to represent the ontology.


     2.1.1 Ontology Languages and Their Constructs

RDF 20 is a data model for the metadata based on XML. It uses Uniform Resource Identifies (URI) to identify resources. It represents relations between resources in the domain that is understandable by machine. To show these relations it uses triples like (Subject, Predicate, Object), which can be represented as a direct graph with Subject and Object being nodes and Predicate being the edge. It also adds to the semantic content by using containers and reification (Statements about Statements). RDF Schema 19 is a language to express concepts, relations between concepts and their attributes and constraints. It is a semantic extension of RDF with the added feature of reasoning and advanced search. Unlike RDF, in RDF Schema, classes and properties could be used to describe other classes and properties. RDF Schema is very expressive, but still has many shortcomings. Among them are the cardinality constraints that put limits on the maximum and minimum values that a property might have. It is also not able to express transitivity, uniqueness, equivalence, union, intersection and disjointness. These issues have been addressed in the OWL language.

OWL 16 is capable of conveying semantic and meaning more than XML, RDF or RDF schema does. It is the latest language (after DAML+OIL) added to the family of ontology languages by W3C. Because OWL is capable of reasoning, even for a simple set of rules it might be undecidable. That is why OWL comes as a layered language with three layers: OWL Full is a semantic and syntactic extension to RDF and RDF Schema and it is likely to be undecidable. OWL DL is a decidable version of OWL Full with a friendlier syntax written in description logic. The third one is OWL Lite, which is a subset of OWL DL and is more tractable than the other two.

OWL covers following constructs from RDF Schema: rdf:Class, rdf:Property, rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range and rdf:type. In OWL, two classes or two properties can be declared as synonyms (equivalentClass or equivalentProperty). Same thing might happen to instances. If two classes are equivalent, any instance that belongs to one also belongs to the other one. The same thing is true about two properties that are related through equivalentProperty. They both relate an instance to the same set of instances. There are also differentFrom and allDifferent constructs. The former states that two instances are different and the latter states that all the instances are different. InverseOf, TransitiveProperty, SymmetricProperty, FunctionalProperty and InverseFunctionalProperty are different types of properties. If two properties have the inverse relation, it would be expressed as InverseOf relation. FunctionalProperty is when a property is unique which means the cardinality is either zero or one. If the inverse of the property is functional, then InverseFunctionalProperty is used, which is like a unique key in relational model. MinCardinality, maxCardinality and Cardinality are used to specify the minimum and maximum of the instances of a property that a class is related to. IntersectionOf states the intersection of classes. OWL DL and OWL Full have other constructs in addition to what we explained above. These are: Class Axioms like oneOf and disjointWith; Boolean combinations like unionOf, intersectionOf and complementOf; Arbitrary Cardinality and Filler Information like hasValue.


     2.1.2 Ontology Storage

In OntoGloss, for each construct in the ontology, there is a table that saves all the relevant information about that construct plus information such as version, current status and original ontology that has defined the construct. Figure 2 shows part of the schema on relation between Class and SubClassOf tables. Table 1 and Table 2 show these two tables populated with part of the GOLD ontology 8 in Figure 3.


Figure 2. Relations between Class and SubClassOf tables


Table 1. Class Table


Table 2. SubClassOf Table


Figure 3. Part of the Gold Ontology

     2.2 Lexical Reference Interface

Annotation tools can benefit from lexical references that provide user with the semantic of the word. This includes synonyms, meronyms (part of), hypernyms and hyponyms (is a kind of). The better the user knows the word, the better he/she is able to annotate it with ontology concepts. WordNet is the best–known lexical reference system for English language. Through Lexical Reference Interface, OntoGloss is able to link to WordNet or other lexical reference systems. In the future, for languages that do not have a lexical reference (specially endangered languages), linguist would be able to add or update the lexicon during the annotation process.

Wordnet2sql 1 has converted WordNet into a set of tables that can be used in any RDBMS. Our position is that the same schema or a subset of it can be used for other languages. Relations between tables are presented in Figure 4. In this figure, each word (lemma) in the Word table has a wordno that links the word to all of its senses (or semantic information) in the Sense table. For example the word have has 22 synsetno in the Sense table. Each of these synsetno has a definition in SynSet table and a sample text in the Sample table. Other tables provide other semantic information for the word including semantical relations and lexical relations.

Figure 5 shows the steps involved in finding synonyms of the word have based on the presented schema. Steps are marked with circles to show which tables provide output for that step. Step 1, finds the wordno based the input from the Word table. In step 2, Sense table returns all the synsetno and tagcnt for the wordno. Synsetno is the link to the meaning of the word in the SynSet table. That is why in step 3, each synsetno is examined separately. In step 4, the SynSet table returns the definition of each synsetno along with a lexno, which relates the synset to the LexName table. In step 5, LexName table gives a general sense of the word and whether it is related to people, plants, body parts or other general categories. Step 6 loops through all the synsetno for a wordno in the Sense table. In step 7, the Sense table returns all the wordno for a synsetno and step 8 traces the wordno back to the lemma in the Word table. In step 9, the Sample table returns a sample of how the word is used in a sentence. For each of the senses, steps 4 through 9 are repeated.


Figure 4. Relational schema adopted for Lexical Reference based on wordnet2sql 1


Figure 5. a) and (b) are two iterations of the algorithm for finding synonyms of the word have

     2.3 Annotation Positioning Module

If stand–off annotation wants to be successful, it should be able to relate the annotation to the exact same location that user had intended even if the document goes through changes. Borrowing the idea from 17, we are using three location descriptors and their corresponding reattachment algorithm that attach annotation back to its location. The first descriptor is a unique ID that document provides. For this descriptor to work, every element of the document has to have a unique ID. The reattachment algorithm for this descriptor uses this ID to find the exact annotation location.

The second descriptor is the TreeWalk. Using Document Object Model (DOM) 7, the start position of the annotation and the end position are saved as the algorithm walks the tree structure of the document from the root down to the leaves. The reattachment algorithm uses the path information to find and mark the annotation location. As a complement to these methods, we are using context descriptor. Context is defined as words surrounding the annotated text. To make sure the exact same location is marked, the saved context should match with the context of the current location. The number of context words before and after the annotation location can affect the accuracy and efficiency of the reattachment algorithm.


     2.4 User Interface

Figure 6 shows a snapshot of the OntoGloss. In this figure, the word have is annotated as instance of the class Verb in GOLD ontology. As the figure shows, moving the mouse over the annotated text (that is marked with !!) displays the type of the annotated text. Any document, local or on the Web, can be annotated by highlighting the text and drag and drop operation. Once a section of the text is selected (a morpheme, a word or a paragraph), user can drag the section and drop it on a concept in the ontology. This will create an RDF triple in the form of (Subject rdf:type Object) in the Annotation Database. Subject is the selected text and Object is the class that the text is of its type.

Here is a sample of the OntoGloss output. For brevity, the URI before the # sign is replace with Doc1 and Doc2.

While a word is selected, one can get its different synsets from WordNet through Lexical Reference Interface.


     3. Related Work

There are many text annotation tools available including Amaya 11 from W3C and KIM 18 from OntoText lab. Amaya is RDF-based annotation but it is limited to pieces of information about the Author, Type, Creator, Last modified or a text that annotator provides. KIM is another general–purpose annotation tool that uses KIM Ontology (KIMO) and a knowledge base of general important terms to automatically annotate a document. Although KIM's approach in using ontology is similar to OntoGloss, the main difference is the ability of OntoGloss in using different ontologies and different versions of the same ontology plus the semi–automatic nature of OntoGloss that is warranted for a scientific field that needs expert's input (in this case, linguist's input). Both KIM and OntoGloss are using Sesame as their main RDF repository.

OntoAnnotate 22 is another text annotation tool. OntoAnnotate keeps a local copy of the document in the document management system along with the metadata that annotates the document. In our approach, documents stay where they are and we only keep the annotation triples in Annotation Database. The other big difference is that in OntoGloss the annotation is in the morpheme's level. In the future we can benefit from OntoAnnotate extraction–based approach for semi–automatic annotation.

MnM 25 is training the system with a set of documents and is learning through the initial manual annotation and subsequent Information Extraction methods. The result is a set of induced rules that can be used to extract information from the text. The main difficulty in using MnM for the linguistic field is that there are not usually many documents to learn from for most of the endangered languages.


Figure 6. A snapshot of OntoGloss

OntoShare 6 provides an ontology–based annotation system to share resources among participants. Users annotate resources with RDF(S) based on a pre–defined ontology. These annotations are saved along with the user profile of the annotator and can be accessed by other users interested in the same resources. Each concept in the ontology is associated with a set of terms that are retrieved by a ranking algorithm from the document. These terms are used when the system matches shared information against user profiles at query time. OntoShare supports a degree of ontology evolution by modifying the set of terms associated with each concept in the ontology. In other word, characterization of classes changes without any change to the ontology itself.


     4. Future Work

For linguistics, using ontology in applications (or in annotation) is a relatively new idea. It is imperative that the problem of ontology versioning is addressed in early stages before applications commit themselves to concepts from a particular version. In linguistics, like any other field, ontology might go through changes. These changes are due to any of the following reasons:

There are new discoveries in the field. In linguistics field, new knowledge about languages and specially endangered languages are gathered everyday. New discovered knowledge might be unique to a specific language but still forces a general ontology to change so it accommodates the new knowledge.

When the conceptualization changes. Experts tend to change their stands on concept definitions even when everything else is the same. Something that is called a class in one version might be called a property in another version.

Change in the scope. Ontology developer might decide on expanding the domain of ontology or a general linguistic ontology might be expanded to include knowledge about phonetics.

Imported ontologies. Ontology might be imported to other ontologies. Imported ontology might change independently inside the importing ontology. If we import ontology of phonetics to a general linguistic ontology, any versioning in the phonetics ontology will force the change in the general ontology.

Figure 7 shows two versions of the GOLD ontology 8 in Protégé 24. The highlighted item (SelfConnectedObject) is one of the nodes where the change has occurred. As the figure shows, among many other changes, in the new version the WrittenExpression class is added while OrthSentence is removed. In the new version, the Character class is a sibling of SymbolicString while it used to be its child. Between two consecutive versions of GOLD, there were 156 changes in the definition of classes/properties, 105 additions in the new ontology and 197 classes/properties that are removed. Retrieving annotations when change in ontology is allowed is significantly harder than when the ontology is fixed. If a class is removed or changed, all the instances of that class might be inaccessible unless we devise a mechanism to access those instances. After the change, queries that used to retrieve instances would not work or return false results. In the next phase of this project we are focusing on solving the ontology versioning problem.


Figure 7. Two versions of GOLD ontology. Older version on the left (11/11/03) and newer version on the right (1/22/04)

References

1.Bergmair, R. Wordnet2sql. As seen on May 2005 at http://wordnet2sql.infocity.cjb.net/about-software.html
2. Berners-Lee, T., Hendler, J., and Lassila, O. (2001) The Semantic Web: A new form of Web content that is meaningful to         computers will unleash a revolution of new possibilities. The Scientific American 284: 34-43.
3. Bird, S. and Liberman, M., A formal framework for linguistic annotation. Speech Communication 33(1,2), pp 23-60, 2001.
4. Broekstra, J., Kampman, A. and Van Harmelen, F., Sesame: A generic architecture for storing and querying RDF and RDF         schema. In the Proceedings of the 1st International Semantic Web Conference, Sardinia, Italia, June, 2002.
5. Chebotko, A., Deng, Y. Lu, S., Fotouhi, F. and Aristar, A. An Ontology–based Multimedia Annotator for the Semantic Web of         Language Engineering, International Journal on Semantic Web and Information Systems, 1(1), pp. 50-67, January, 2005.
6.Davies, J., Duke, A. and Sure, Y., Ontoshare: A Knowledge Management Environment for Virtual Communities of Practice in         K-CAP 2003, Second International Conference on Knowledge Capture, Oct. 23-26, 2003, Florida, USA.
7.Document Object Model (DOM), http://www.w3.org/DOM/
8.Farrar, S. and Langendoen, T. A Linguistic Ontology for the Semantic Web , GLOT International 7(3), 97-100, 2003.
9.Gomez–Perez, A. and Corcho, O. Ontology languages for the Semantic Web. IEEE Intelligent Systems Vol. 17, No. 1, pp.         54-60, January/February, 2002.
10.Ide, N., Romary, L., de la Clergerie, E. (2003). International Standard for a Linguistic Annotation Framework. Proceedings of         HLT-NAACL'03 Workshop on The Software Engineering and Architecture of Language Technology, Edmunton.
11.Kahan, J., Koivunen, M., Prud'Hommeaux, E. and Swick, R., Annotea: An Open RDF Infrastructure for Shared Web         Annotations. In Proceedings of WWW10, Hong Kong, May 2001.
12.Klein, M., Fensel, D., Harmelen, F. and Horrocks, I. The Relation between Ontologies and XML Schemas, Linkoping         Electronic Articles in Computer and Information Science, 6(4), (2001).
13. Lu, S., Dong, M. and Fotouhi, F. (2002) The Semantic Web: opportunities and challenges for next-generation Web         applications. Information Research 7(4), Available at:http://InformationR.net/ir/7-4/paper134.html
14. Lu, S., Liu, D., Fotouhi, F., Dong, M., Reynolds, R., Aristar, A., Ratliff, M., Nathan, G., Tan, J. and Powell, R. Language         Engineering for the Semantic Web: a Digital Library for Endangered Languages, International Journal of Information         Research, 9(3), April 2004.
15.OntoWeb Consortium. Ontology–based information exchange for knowledge management and electronic commerce —         IST–2000–29243. http://www.ontoweb.org, 2002.
16.OWL Web Ontology Language Overview. http://www.w3.org/TR/owl-features/.
17.Phelps, T. and Wilensky, R. Robust intra–document locations, Proceedings of the 9th international World Wide Web         conference on Computer networks : the international journal of computer and telecommunications networking, pp. 105-118,         2000.
18.Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A., Goranov, M., KIM — Semantic Annotation Platform. 2nd         International Semantic Web Conference (ISWC2003), 20-23 October 2003, Florida, USA. LNAI Vol. 2870, pp. 484-499,         Springer-Verlag Berlin Heidelberg 2003.
19.RDF Vocabulary Description Language 1.0: RDF Schema.http://www.w3.org/TR/rdf-schema/.
20.Resource Description Framework (RDF) http://www.w3.org/RDF/.
21.Staab, S., Handschuh, S., Madche, A. Metadata and the Semantic Web — and CREAM (Extended Abstract of Invited Talk).         In Proceedings of the DELOS–2001 workshop. September 8-9, 2001, Darmstadt, ERCIM, 2001.
22.Staab, S., Maedche, A. and Handschuh, S., An Annotation Framework for the Semantic Web. Proc. 1 Int. Workshop on         MultiMedia Annotaion, Tokyo, 2001.
23.The Linguist's Shoebox. www.sil.org/computing/shoebox.
24.The Protégé project. http://protege.stanford.edu
25. Vargas–Vera, M., Motta, E. Domingue, J., Lanzoni, M., Stutt, A. and Ciravegna, F., MnM: Ontology Driven Semi–automatic         and Automatic Support for Semantic Markup. In Proceedings of EKAW 2002.
26.WordNet: A lexical database for English.http://www.cogsci.princeton.edu/~wn/