Ontogloss: An Ontology-Based Annotation Tool |
|
|
|
(Department of Computer Science, Wayne State University, Detroit, Michigan) |
||
|
OntoGloss is an ontology based annotation tool that uses pre–defined concepts in ontology to mark–up a document. The difference between regular annotation and ontology–based annotation is that in the former, the annotation is a plain text that is collected based on a fixed structure 21, while in the later, the annotation is a set of instances of classes and relations based on the domain ontology. In ontology-based annotation, the annotation process is the process of assigning the annotated text to a concept in the ontology (instantiating a class) or to a data type or relating it to another annotated text (instantiating a relation). Such annotation is in line with the requirements suggested in 10 and provides:
Based on Bird’s definition 3: “Linguistic annotation covers any descriptive or analytic notations applied to raw language data”. For linguists, marking up a document is a way of preserving its content. This is more urgent in the case of languages that are in the danger of disappearing 14,5. Endangered languages can tremendously benefit from ontology–based annotations that explicitly express the semantic of the content. Ontology, as a way of formalizing knowledge, can help linguists to solve the incompatibility of the markup data in a multilingual annotation and search environment. Ontology captures the knowledge in the field in a generic form so that it can be understood, shared and reused by the community. Later, this knowledge can be used to automatically annotate morphemes, words or phrases in other documents. Ontologies have been developed to share knowledge “between people and heterogeneous and distributed systems” 12. They are used in Knowledge Management, E–Commerce, Natural Language Applications, Intelligent Information Integration, Information Extraction and Information Retrieval 15. By formalizing terminology and relation between concepts in the field, ontologies make integration between different sources of information possible. Ontology is usually for the whole domain or sub–domain and not just for an application. Once experts develop ontology for a domain, it would be a resource for everybody else to use. Ontology in different areas are emerging and by the advent of ontology languages like OWL 16 it is becoming easier and easier to develop one from scratch or use those that are available as a starting point to develop the new ones. OntoGloss uses the linguistic knowledge gathered through annotation by the community to automatically annotate other documents. For any annotated document, a set of RDF 20 triples is created and saved in the database. On the next visit to the same document, OntoGloss retrieves all the triples for the document from the database and marks all the annotated sections. As long as the structure of the document does not change dramatically (which is usually the case in linguistics) this would create the same annotated sections. OntoGloss uses Uniform Resource Identifies (URI) 20 to identify resources and represents relations between them. It keeps annotations separate from the actual documents and supports two modes of operations: local and remote. In the local mode, annotated data is saved locally and is used in annotating documents that are visited for the first time. In the remote or shared annotation server mode, linguist can add his/her annotated data to a server for the community to use.
OntoGloss has the following features:
When a document is visited for the first time, OntoGloss compares each word with all the annotated text in the database and assigns the same type of annotation to words. This will serve as an initial suggestion and can be changed by the linguist if needed. Classes in the ontology are color–coded. An annotated text has the same color as the class that is used in annotation. This gives a visual clue to the linguist on the type of markup. There are many text annotators available both as open source and as commercial products. What is different about a linguistic annotator is that words in linguistics are broken up into morphemes. OntoGloss is able to annotate morphemes in a word. For example, if xxxabc is composted of xxx with a suffix –abc, a linguist using OntoGloss is able to annotate each morpheme separately. In the automatic annotation of new documents, when OntoGloss finds yyyabc, it can determine if it has the same suffix [23] and annotate it with the same class in the ontology. In section 2, we begin by introducing components of OntoGloss. After an introduction to ontology languages, we go into more details of a few modules namely Ontology Management Interface, Lexical reference Interface, Annotation Positioning and User Interface. In section 3 we look at the related works and it section 4, we look at some ideas for the future work. |
|
|
Figure 1 shows the OntoGloss architecture. In this figure: |
|
|
Ontology is made of a set of concepts in a domain with their attributes and relations. There are also constraints, axioms and other constructs that represent the general knowledge in the domain. Concepts or classes (either physical or abstract) are the basic blocks. Everything else in the ontology is meant to represent knowledge about these concepts. This knowledge might be just concept's attributes or it might be more elaborate like cardinality of properties of concepts explaining how classes are related to each other and other entities in the world. Relational properties are binary relations between two concepts. They might be symmetrical or transitive or both. A relation is symmetrical if both concepts are in the same relation with each other. A relation is transitive if relation between A and B and relation between B and C imply that there is a relation between A and C. Inverse relational property is the inverse of a relation like isParent and its inverse isChild. Concept hierarchy is a taxonomy of concepts that organizes concepts in a generalization and specialization relationship [9]. In what follows, we bring a quick introduction to RDF 20, RDF Schema 19 and OWL 16 and then introduce the schema that we have picked to represent the ontology. |
|
|
RDF 20 is a data model for the metadata based on XML. It uses Uniform Resource Identifies (URI) to identify resources. It represents relations between resources in the domain that is understandable by machine. To show these relations it uses triples like (Subject, Predicate, Object), which can be represented as a direct graph with Subject and Object being nodes and Predicate being the edge. It also adds to the semantic content by using containers and reification (Statements about Statements). RDF Schema 19 is a language to express concepts, relations between concepts and their attributes and constraints. It is a semantic extension of RDF with the added feature of reasoning and advanced search. Unlike RDF, in RDF Schema, classes and properties could be used to describe other classes and properties. RDF Schema is very expressive, but still has many shortcomings. Among them are the cardinality constraints that put limits on the maximum and minimum values that a property might have. It is also not able to express transitivity, uniqueness, equivalence, union, intersection and disjointness. These issues have been addressed in the OWL language. OWL 16 is capable of conveying semantic and meaning more than XML, RDF or RDF schema does. It is the latest language (after DAML+OIL) added to the family of ontology languages by W3C. Because OWL is capable of reasoning, even for a simple set of rules it might be undecidable. That is why OWL comes as a layered language with three layers: OWL Full is a semantic and syntactic extension to RDF and RDF Schema and it is likely to be undecidable. OWL DL is a decidable version of OWL Full with a friendlier syntax written in description logic. The third one is OWL Lite, which is a subset of OWL DL and is more tractable than the other two. OWL covers following constructs from RDF Schema: rdf:Class, rdf:Property, rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range and rdf:type. In OWL, two classes or two properties can be declared as synonyms (equivalentClass or equivalentProperty). Same thing might happen to instances. If two classes are equivalent, any instance that belongs to one also belongs to the other one. The same thing is true about two properties that are related through equivalentProperty. They both relate an instance to the same set of instances. There are also differentFrom and allDifferent constructs. The former states that two instances are different and the latter states that all the instances are different. InverseOf, TransitiveProperty, SymmetricProperty, FunctionalProperty and InverseFunctionalProperty are different types of properties. If two properties have the inverse relation, it would be expressed as InverseOf relation. FunctionalProperty is when a property is unique which means the cardinality is either zero or one. If the inverse of the property is functional, then InverseFunctionalProperty is used, which is like a unique key in relational model. MinCardinality, maxCardinality and Cardinality are used to specify the minimum and maximum of the instances of a property that a class is related to. IntersectionOf states the intersection of classes. OWL DL and OWL Full have other constructs in addition to what we explained above. These are: Class Axioms like oneOf and disjointWith; Boolean combinations like unionOf, intersectionOf and complementOf; Arbitrary Cardinality and Filler Information like hasValue. |
|
|
In OntoGloss, for each construct in the ontology, there is a table that saves all the relevant information about that construct plus information such as version, current status and original ontology that has defined the construct. Figure 2 shows part of the schema on relation between Class and SubClassOf tables. Table 1 and Table 2 show these two tables populated with part of the GOLD ontology 8 in Figure 3. ![]()
Figure 2. Relations between Class and SubClassOf tables
![]()
Table 1. Class Table
![]()
Table 2. SubClassOf Table
![]()
Figure 3. Part of the Gold Ontology
|
|
|
Annotation tools can benefit from lexical references that provide user with the semantic of the word. This includes synonyms, meronyms (part of), hypernyms and hyponyms (is a kind of). The better the user knows the word, the better he/she is able to annotate it with ontology concepts. WordNet is the best–known lexical reference system for English language. Through Lexical Reference Interface, OntoGloss is able to link to WordNet or other lexical reference systems. In the future, for languages that do not have a lexical reference (specially endangered languages), linguist would be able to add or update the lexicon during the annotation process. Wordnet2sql 1 has converted WordNet into a set of tables that can be used in any RDBMS. Our position is that the same schema or a subset of it can be used for other languages. Relations between tables are presented in Figure 4. In this figure, each word (lemma) in the Word table has a wordno that links the word to all of its senses (or semantic information) in the Sense table. For example the word have has 22 synsetno in the Sense table. Each of these synsetno has a definition in SynSet table and a sample text in the Sample table. Other tables provide other semantic information for the word including semantical relations and lexical relations. Figure 5 shows the steps involved in finding synonyms of the word have based on the presented schema. Steps are marked with circles to show which tables provide output for that step. Step 1, finds the wordno based the input from the Word table. In step 2, Sense table returns all the synsetno and tagcnt for the wordno. Synsetno is the link to the meaning of the word in the SynSet table. That is why in step 3, each synsetno is examined separately. In step 4, the SynSet table returns the definition of each synsetno along with a lexno, which relates the synset to the LexName table. In step 5, LexName table gives a general sense of the word and whether it is related to people, plants, body parts or other general categories. Step 6 loops through all the synsetno for a wordno in the Sense table. In step 7, the Sense table returns all the wordno for a synsetno and step 8 traces the wordno back to the lemma in the Word table. In step 9, the Sample table returns a sample of how the word is used in a sentence. For each of the senses, steps 4 through 9 are repeated. ![]() Figure 4. Relational schema adopted for Lexical Reference based on wordnet2sql 1
![]() Figure 5. a) and (b) are two iterations of the algorithm for finding synonyms of the word have
|
|
|
If stand–off annotation wants to be successful, it should be able to relate the annotation to the exact same location that user had intended even if the document goes through changes. Borrowing the idea from 17, we are using three location descriptors and their corresponding reattachment algorithm that attach annotation back to its location. The first descriptor is a unique ID that document provides. For this descriptor to work, every element of the document has to have a unique ID. The reattachment algorithm for this descriptor uses this ID to find the exact annotation location. The second descriptor is the TreeWalk. Using Document Object Model (DOM) 7, the start position of the annotation and the end position are saved as the algorithm walks the tree structure of the document from the root down to the leaves. The reattachment algorithm uses the path information to find and mark the annotation location. As a complement to these methods, we are using context descriptor. Context is defined as words surrounding the annotated text. To make sure the exact same location is marked, the saved context should match with the context of the current location. The number of context words before and after the annotation location can affect the accuracy and efficiency of the reattachment algorithm. |
|
|
Figure 6 shows a snapshot of the OntoGloss. In this figure, the word have is annotated as instance of the class Verb in GOLD ontology. As the figure shows, moving the mouse over the annotated text (that is marked with !!) displays the type of the annotated text. Any document, local or on the Web, can be annotated by highlighting the text and drag and drop operation. Once a section of the text is selected (a morpheme, a word or a paragraph), user can drag the section and drop it on a concept in the ontology. This will create an RDF triple in the form of (Subject rdf:type Object) in the Annotation Database. Subject is the selected text and Object is the class that the text is of its type. Here is a sample of the OntoGloss output. For brevity, the URI before the # sign is replace with Doc1 and Doc2. ![]() While a word is selected, one can get its different synsets from WordNet through Lexical Reference Interface. |
|
|
There are many text annotation tools available including Amaya 11 from W3C and KIM 18 from OntoText lab. Amaya is RDF-based annotation but it is limited to pieces of information about the Author, Type, Creator, Last modified or a text that annotator provides. KIM is another general–purpose annotation tool that uses KIM Ontology (KIMO) and a knowledge base of general important terms to automatically annotate a document. Although KIM's approach in using ontology is similar to OntoGloss, the main difference is the ability of OntoGloss in using different ontologies and different versions of the same ontology plus the semi–automatic nature of OntoGloss that is warranted for a scientific field that needs expert's input (in this case, linguist's input). Both KIM and OntoGloss are using Sesame as their main RDF repository. OntoAnnotate 22 is another text annotation tool. OntoAnnotate keeps a local copy of the document in the document management system along with the metadata that annotates the document. In our approach, documents stay where they are and we only keep the annotation triples in Annotation Database. The other big difference is that in OntoGloss the annotation is in the morpheme's level. In the future we can benefit from OntoAnnotate extraction–based approach for semi–automatic annotation. MnM 25 is training the system with a set of documents and is learning through the initial manual annotation and subsequent Information Extraction methods. The result is a set of induced rules that can be used to extract information from the text. The main difficulty in using MnM for the linguistic field is that there are not usually many documents to learn from for most of the endangered languages. ![]()
Figure 6. A snapshot of OntoGloss
OntoShare 6 provides an ontology–based annotation system to share resources among participants. Users annotate resources with RDF(S) based on a pre–defined ontology. These annotations are saved along with the user profile of the annotator and can be accessed by other users interested in the same resources. Each concept in the ontology is associated with a set of terms that are retrieved by a ranking algorithm from the document. These terms are used when the system matches shared information against user profiles at query time. OntoShare supports a degree of ontology evolution by modifying the set of terms associated with each concept in the ontology. In other word, characterization of classes changes without any change to the ontology itself. |
|
|
For linguistics, using ontology in applications (or in annotation) is a relatively new idea. It is imperative that the problem of ontology versioning is addressed in early stages before applications commit themselves to concepts from a particular version. In linguistics, like any other field, ontology might go through changes. These changes are due to any of the following reasons: There are new discoveries in the field. In linguistics field, new knowledge about languages and specially endangered languages are gathered everyday. New discovered knowledge might be unique to a specific language but still forces a general ontology to change so it accommodates the new knowledge. When the conceptualization changes. Experts tend to change their stands on concept definitions even when everything else is the same. Something that is called a class in one version might be called a property in another version. Change in the scope. Ontology developer might decide on expanding the domain of ontology or a general linguistic ontology might be expanded to include knowledge about phonetics. Imported ontologies. Ontology might be imported to other ontologies. Imported ontology might change independently inside the importing ontology. If we import ontology of phonetics to a general linguistic ontology, any versioning in the phonetics ontology will force the change in the general ontology. Figure 7 shows two versions of the GOLD ontology 8 in Protégé 24. The highlighted item (SelfConnectedObject) is one of the nodes where the change has occurred. As the figure shows, among many other changes, in the new version the WrittenExpression class is added while OrthSentence is removed. In the new version, the Character class is a sibling of SymbolicString while it used to be its child. Between two consecutive versions of GOLD, there were 156 changes in the definition of classes/properties, 105 additions in the new ontology and 197 classes/properties that are removed. Retrieving annotations when change in ontology is allowed is significantly harder than when the ontology is fixed. If a class is removed or changed, all the instances of that class might be inaccessible unless we devise a mechanism to access those instances. After the change, queries that used to retrieve instances would not work or return false results. In the next phase of this project we are focusing on solving the ontology versioning problem. ![]()
Figure 7. Two versions of GOLD ontology. Older version on the left (11/11/03) and newer version on the right (1/22/04)
|
|
|
1.Bergmair, R. Wordnet2sql. As seen on May 2005 at http://wordnet2sql.infocity.cjb.net/about-software.html |