Baden Hughes and David Nash, University of Melbourne


Roadtesting GOLD in Shoebox/Toolbox

GOLD1 is an emerging linguistic ontology and data category registry for morphosyntactic annotation of human language data. GOLD is intended to capture the knowledge of a well-trained linguist - an attempt to codify the general knowledge of the field.

Given that linguists already use various types of software to perform these analyses, the intersection between GOLD, and these existing tools, is an important area for research. While first-generation ontologically aware tools are emerging, well-entrenched software will remain in common use for the forseeable future, despite being "ontologically-challenged" as far as linguistic markup is concerned.

Thus the motivation of this paper is threefold. First, we consider the benefits of, and complexity in, reducing GOLD to a set of usable terms and formats for direct embedding in tools currently used in descriptive practice. Second, we evaluate how well the linguistic units and categories in GOLD apply to real linguistic data. Third, we review methods for forward migration of legacy data into ontologically-aware markup.

In this paper we consider the case of Shoebox2 / Toolbox3, a common linguistic analysis framework for the preparation of lexicons and interlinear analysis in language documentation and description, which has widespread adoption. Internally, Shoebox and Toolbox both allow unrestricted user specification of annotation labels, at both a field and morphosyntactic unit levels.

We integrate GOLD in Shoebox/Toolbox by first creating a template based on the lexical4 and interlinear 5 models from EMELD and then implementing GOLD as the set of categories encoded as a Shoebox/Toolbox range set. We apply this template to Australian language data from languages Kayardild (using the Shoebox lexicon supplied to EMELD 6) and Ganggalida (using data from ASEDA 7). We also support the MDF Shoebox template and a template developed for Australian language dictionaries in a similar fashion. The approach is in line with work pertaining to best practice creation of Shoebox resources8.


References
[1] General Ontology for Linguistic Description
http://www.linguistics-ontology.org/

[2] Shoebox
http://www.sil.org/computing/shoebox

[3] Toolbox
http://www.sil.org/computing/toolbox

[4] Cathy Bow, Baden Hughes and Steven Bird, 2003.
A Four Level Model of Interlinear Text. Proceedings of the EMELD Digitization Project Workshop on Digitizing and Annotating Texts and Field Recordings, 11-13 July 2003, Michigan State University.
http://emeld.org/workshop/2003/bowbadenbird-paper.pdf

[5] John Bell and Steven Bird, 2000.
A Preliminary Study of the Structure of Lexicon Entries. Proceedings of the Workshop on Web-Based Language Documentation and Description, 12-15 December 2000, Philadelphia, USA.
http://www.ldc.upenn.edu/exploration/expl2000/papers/bell/bell.html

[6] Virtual Kayardild Repository
http://www.cs.mu.oz.au/research/lt/projects/kayardild

[7] Aboriginal Studies Electronic Data Archive (ASEDA)
http://coombs.anu.edu.au/SpecialProj/ASEDA/

[8] Dafydd Gibbon, Catherine Bow, Steven Bird and Baden Hughes, 2004.
Securing Interpretability: The Case of Ega Language Documentation. Proceedings of the 4th International Conference on Language Resources and Evaluation. European Language Resources Association: Paris. pp 1369-1372.

Author Bio(s):
Baden Hughes is a Research Fellow in the Department of Computer Science and Software Engineering at the University of Melbourne.

David Nash is a Visiting Fellow at the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS).