Farhad Mostowfi, Farshad Fotouhi & Anthony Aristar,
Wayne State University

OntoGloss: An Ontology-based Annotation Tool

In this paper, we present OntoGloss. OntoGloss is an ontology based annotation tool that uses pre-defined concepts in ontology to markup a document. The difference between regular annotation and ontology-based annotation is that in the former, the annotation is a plain text that is collected based on a fixed structure3, while in the later, the annotation is a set of instances of classes and relations based on the domain ontology. In ontology-based annotation , the annotation process is assigning the annotated text to a concept (instantiating a class) in the ontology or to a data type or relating it to another annotated text (instantiating a relation). Such annotation is in line with the requirements suggested in2 and provides expressive adequacy, semantic adequacy, incrementality, uniformity, openness, extensibility, human readability, processability (explicitness) and consistency.

OntoGloss is a stand-off annotator that annotates documents at every granularity level, from the document level down to the morpheme's level. Its web interface and drag and drop functionality, lets the user browse any textual document and easily annotate it with classes from available ontologies. Annotated data is exported into RDF format, which is a data model for the metadata based on XML. RDF data can be loaded into an RDF repository for querying and retrieval. RDF as the main storage and exchange method makes knowledge in the field portable to other applications and readable by machine as well as by human. Each annotated document could be linked to a language code, so that one can extract all material on a particular language.

OntoGloss uses the linguistic knowledge gathered through annotation by the community to automatically annotate other documents. For any annotated page, a set of RDF triples is created and saved in the database. On the next visit to the same document, OntoGloss retrieves all the triples for the page from the database and marks all the annotated sections. As long as the structure of the document does not change dramatically (which is usually the case in linguistics) this would create the same annotated sections. OntoGloss uses Uniform Resource Identifies (URI) to identify resources and represents relations between them. It keeps annotations separate from the actual documents and supports two modes of operations: local and remote. In the local mode, annotated data is saved locally and is used in annotating documents that are visited for the first time. In the remote or shared annotation server mode, linguist can add his/her annotated data to a server for the community to use.

In this paper, we will show how we used GOLD ontology1 to annotate documents with OntoGloss. We will also show how the annotated data is loaded into an RDF repository for querying and retrieval. Then we will explain about auto-annotation. We will also show how we can link a lexical reference system like WordNet to OntoGloss to facilitate the annotation process.

[1] Farrar, S. and Langendoen, T. A Linguistic Ontology for the Semantic Web , GLOT        International 7(3), 97-100, 2003.
[2] Ide, N., Romary, L., de la Clergerie, E. (2003). International Standard for a Linguistic        Annotation Framework. Proceedings of HLT-NAACL'03 Workshop on The Software        Engineering and Architecture of Language Technology, Edmunton.
[3] Staab, S., Handschuh, S., Madche, A. Metadata and the Semantic Web - and CREAM        (Extended Abstract of Invited Talk). In Proceedings of the DELOS-2001 workshop.        September 8-9, 2001, Darmstadt, ERCIM, 2001.