Joseph E. Grimes, SIL International, University of Hawaii at Manoa

Designing Tools That Promote Archiving

Tools for linguistics can be designed to promote consistent archiving. This includes structures for metadata on the persons who collect and interpret the data, the works they produce, and the speech varieties they include. Three areas of linguistics highlight diverse metadata patterns:

Comparative Linguistics: Wordcorr, with 447 downloads, helps linguists apply the comparative method to parallel word lists. It requests metadata for each user: an ID, name, email address, and institution. Each data collection has a title prefixed with the creator's ID, collaborators, language of description, publication and copyright information, and information on accessibility. Each speech variety being compared appears as a subject language; it contains metadata on published and unpublished sources, language names and affiliations, Ethnologue code, and other things relevant to comparativists but not yet in the OLAC schema. Wordcorr transforms its metadata into OLAC form for incorporation into a repository.

Sociolinguistics: Multilingual situations cannot be assessed without information on how proficient segments of a community are. One test is based on the observation that you have to know a language well in order to repeat whole sentences in it. The sentence repetition test discriminates lower proficiency well, higher proficiency poorly.

The test is simple, but setting it up correctly is not. One computational tool from the 1980s has the algorithms right, but is completely user unfriendly. So a redesign is needed emphasizing the user interfaces and metadata for
  • the test designer for a particular L2,
  • native readers and testers for the candidate sentences,
  • L2 speakers at various proficiency levels established by other means for calibration purposes,
  • test protocols for every calibration test,
  • calibration results of every calibration sentence,
  • equivalent sets of test sentences,
  • test administrators trained to score the test, each using a separate PDA,
  • the sample of L2 speakers being tested,
  • test protocols for every field test, collated from multiple PDAs,
  • several kinds of summary results.

Lexicography: In the early design stage, a Web-based tool for producing dictionaries of endangered or underdocumented languages will probably use a factory design pattern to accommodate diverse structures - alphabetic versus semantic arrangement of entries, internal structuring of entries by sense or by part of speech, different uses of subentries, for example.

Different granularity is need for different presentations:
  • the same example, or fragments of it, may illustrate a number of entries
  • its source may or may not be included in a given presentation
  • multivalent lexical function values, linked in both directions, may be presented to different levels of detail
  • metadata for different presentation options need to be incorporated besides the usual metadata for creator, collaborators, library search, subject language, language of description (if different), and possibly the protolanguage if reconstructions or known etyma are included
  • if the dictionary covers more than one speech variety, the same metadata needed for comparative linguistics are needed as well.