The Electronic Encoding of Lexical Resources:
A Roadmap to Best Practice

Gary F. Simons, SIL International
Revised draft of 7 August 2002

EMELD Workshop on Digitizing Lexical Information
2-5 August 2002, Ypsilanti, MI


Introduction

The EMELD proposal (section 3.1) lists the following among the objectives of the project:

The purpose of the workshop is to begin the process of developing best practice recommendations for the markup and metadata description of lexical resources. The purpose of this document is to lay out a roadmap for what needs to be developed in order to meet the objectives of the project. This is done by proposing requirements for the eventual solution and then enumerating consequent features of its implementation. But first I begin with some background definitions.

Background definitions

Before developing a system for encoding lexical resources, it is necessary to define the audience for that system. This was done, for instance, by Ide and others (1992) for the most widely known system of dictionary encoding, namely the TEI guidelines (Sperberg-McQueen and Burnard 2001b). The focus of the TEI guidelines is dictionaries that have already been published in print. The developers of the guidelines saw not only the lexicographer who creates dictionaries and the computational linguist who mines information from encoded dictionaries as within the audience, but also the print historian who wants to study conventions of typesetting and layout. They identify three views of the dictionary and conclude that the markup must be able to encode all three views and mappings among them. The three views are:

In the EMELD project we can (and should) narrow the focus. Our aim is to give guidance to field linguists on how they should create electronically encoded lexical resources so as to maximize their long-term usefulness. In this context only the lexical view is of relevance. The typographic and textual views are not part of the information resource itself, but will be added by automated processes (using stylesheets) that tailor the published appearance to the needs of a given target audience.

To create electronically encoded lexical resources we will use markup languages. A markup language, like a natural language, has a lexicon, syntax, and semantics. The following terms are used throughout this paper to refer to the descriptive artifacts that document these three aspects of markup:

markup vocabulary

Enumerates the lexical inventory of markup: i.e., the set of elements and attributes that are used in marking up a resource. (In practice, the vocabulary is enumerated within the markup schema rather than in a separate document.)

markup schema

Specifies the syntax of markup: i.e., a formal grammar defining constraints on where elements and attributes must or may occur with respect to embedding and relative order. (This is typically realized in an XML DTD or an XML Schema, though other mechanisms are emerging.)

markup metaschema

Specifies the semantics of markup: i.e., a formal mapping from elements and attributes to the linguistic concepts they represent. (This area of markup is not as well developed as the syntactic area, but is beginning to be developed under the impetus of the so-called Semantic Web (W3C 2002).)

Requirements

In this presentation of requirements the individual requirements are set apart as numbered statements in order to facilitate discussion. Similarly, the consequent features are set out as subordinate statements that bear an identification letter, as in:

  1. A requirement from the linguist's point of view
    1. A consequent feature of the implementation
    2. Another feature of the implementation

The first requirement deals with the need for longevity of access far into the future. This aspect of language documentation and description is covered in detail in Bird and Simons (2002); only a few key points are noted here:

  1. Lexical resources, especially those describing endangered languages, need to be accessible by any interested party long into the future.
    1. The archival form of electronically encoded resources should not be in a proprietary, binary format, since the need for proprietary software limits the audience who can access the resources and such formats are likely to become obsolete and inaccessible within a few years.
    2. The archival form of electronically encoded resources should not be an interactive Web application, since upgrades to hardware and system software typically cause these to cease to function within a few years.
    3. The archival form of electronically encoded textual resources should be based on clear text formats that can be read with any text editor and many other tools.
    4. The complexity of lexical resources is such that XML is the format of choice for meeting these requirements.

Microsoft Word documents provide an example of a proprietary, binary format that is not acceptable for long-term preservation of information. Plain text documents formatted with line breaks and spaces are an example of a format that meets requirements a through c; so are tab- or comma-delimited representations of spreadsheets or data tables. But most lexical resources have a more complex structure involving hierarchy and cross-reference, thus a more sophisticated representation is needed. Markup based on the XML standard meets all the above requirements and is now supported by such a wide variety of tools (both open and proprietary) that it has become the clear choice for archival formats. Those unfamiliar with XML are referred to the Text Encoding Initiative's "Gentle Introduction to XML" (Sperberg-McQueen and Burnard 2001a). But what should the nature of the markup vocabulary be?

  1. Linguists need to be able to do more than just read lexicons in display format; they also need to be able to manipulate the content by selectively accessing individual items of information.
    1. The archival form of electronically encoded resources should not follow a strategy of presentational markup; that is, the markup vocabulary should not be one that simply identifies what the information will look like when displayed.
    2. The archival form of electronically encoded lexical resources should follow a strategy of descriptive markup; that is, the markup vocabulary should identify what the individual pieces of information are from a linguistic point of view.
    3. The markup vocabulary for a particular lexical resource should identify all of the elements of information that go into the description of a lexical item, not just some of them.
    4. Users still need a presentational display of the resource; this should be accomplished by applying a stylesheet to the descriptively marked up resource.

HTML markup, when applied to lexical resources, is an example of presentational markup. Though it does have the features of longevity needed for an archival format, it does not offer linguists the ability to do automated processing of a linguistic nature, such as to answer the query "What are the part-of-speech categories used in this lexicon?" For this purpose a markup vocabulary that specifically identifies the linguistic significance of each piece of information is needed. But simply having a markup vocabulary is not enough; for each lexical resource there is also a grammar that defines how the individual markup elements combine to form valid lexical descriptions.

  1. The linguist creating a lexical resource needs for the markup of the resource to be consistent with his or her plan for its content and structure.
    1. Best practice requires the use of a markup schema (such as an XML DTD or an XML schema) to validate a given lexicon as conforming to its plan for the content and structure.
    2. A single markup schema that sanctions all common practices in structuring the content of lexical resources will be too permissive to constrain any single resource to the specific plan of its creator. (See Simons 1998 for a discussion of this point with respect to the TEI DTD for print dictionaries.)
    3. There is enough convergence of practice that it will be possible to develop one or more specific markup schemas that can be recommended for widespread use while being adequately constraining.
    4. There will always be plans for content and structure that are unique enough to require that a unique markup schema be devised for the resource. (This is one of the conclusions of the 2001 EMELD workshop.)

These consequences of requirement 3 thus mean that there will be multiple markup schemas, even in the context of best practice. In order to achieve interoperability of resources when there are multiple markup schemes we will need to introduce a meta-level in our approach to markup:

  1. Linguists need to be able to query and otherwise manipulate multiple lexicons in a single operation, even though they may individually have different markup vocabularies and schemas.
    1. As a foundation for interoperability, there must be a shared ontology (Langendoen and others 2002) for the kinds of information that are marked up in lexical resources.
    2. As the bridge to interoperability, each resource must have a metaschema that formally documents how the elements and attributes of its markup schema map onto the concepts of the common ontology.
    3. The metaschema must be separate from the lexical resource (rather than being an integral part of it) so that multiple resources can share the same metaschema.
    4. It must be possible for a third party to create a metaschema for a resource that lacks one without changing the resource itself. (This implies that the linkage from metaschema to schema to resource is specified through metadata.)

Finally, it is not enough that electronically encoded resources are created. They must also be found and used by others long into the future. This implies a final set of consequences having to do with archiving.

  1. Linguists, educators, speakers of the language, and any other interested citizens of the world need to be able to find and use electronically encoded lexical resources long into the future.
    1. Completed lexical resources (with associated schemas and metaschemas) must be deposited into archives that can guarantee their long-term preservation and access.
    2. In order to make it possible for potential users of the resource to discover that the resource exists, a metadata description of it needs to be written and published in a searchable catalog of worldwide language resources.
    3. In order to help potential users of the resource judge the relevance of the resource, the metadata description needs to include information like identification of the specific language and a characterization of what type of resource it is from a linguistic point of view.

The Open Language Archives Community is already in place with an infrastructure that meets these needs, and EMELD will build on this infrastructure.

The shape of best practice

Taken together, the above requirements and the consequent features of implementation suggest the following shape for best practice:

  Best Practice for Resource Creation What the Community Must Do to Support Best Practice
Lexical description Archive resource as an XML document that is valid with respect to a descriptive markup schema that is supplied with the resource. 1. Document characteristics of best practice descriptive markup.
2. Recommend one or more markup schemas that meet these characteristics.
3. Develop stylesheets for these schemas.
Metadescription for resource discovery Provide OLAC metadata for the resource and deposit it with an OLAC data provider. 4. Define the OLAC metadata standard.
5. Define the controlled vocabulary for identifying lexical resource types in <type.linguistic>.
6. Develop a community service for resource discovery.
Metadescription for resource interoperation Provide a metaschema for the resource. 7. Define a common ontology of the concepts of lexical description.
8. Define the standard markup schema for a metaschema.
9. Develop metaschemas for the schemas recommended in point 2 above.
10. Develop a community service that uses metaschemas to provide interoperation across multiple lexical resources.

When the 10 community action steps listed in the last column have been completed, the "formulation" part of the EMELD objectives listed at the outset of this paper will have been met for the area of lexicons. The "promulgation" part will require additional work in areas like documentation, dissemination, and training.

References

Bird, Steven and Gary Simons, 2002. Seven Dimensions of Portability for Language Documentation and Description, Proceedings of the Workshop on Portability Issues in Human Language Technologies, Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands. Available at: http://arxiv.org/abs/cs/0204020

Ide, Nancy and others, 1992. Principles for encoding machine readable dictionaries, EURALEX'92 Proceedings. Available at: http://www.cs.vassar.edu/~ide/papers/Euralex92.ps

Langendoen, D. Terence and others, 2002, Publications of the EMELD Arizona group. Available at: http://emeld.douglass.arizona.edu:8080/group.html.

Simons, Gary F., 1998. Using architectural processing to derive small, problem-specific XML applications from large, widely-used SGML applications, SIL Electronic Working Papers 1998-006. Available at: http://www.sil.org/silewp/1998/006/.

Sperberg-McQueen, C.M. and Lou Burnard, 2001a. A Gentle Introduction to XML. Chapter 2 of TEI P4: Guidelines for Electronic Text Encoding and Interchange, XML-compatible edition. TEI Consortium. Available at: http://www.tei-c.org/P4X/SG.html

Sperberg-McQueen, C.M. and Lou Burnard, 2001b. Print Dictionaries. Chapter 12 of TEI P4: Guidelines for Electronic Text Encoding and Interchange, XML-compatible edition. TEI Consortium. Available at: http://www.tei-c.org/P4X/DI.html

W3C, 2002. The Semantic Web, an activity of the World Wide Web Consortium. Home page: http://www.w3.org/2001/sw/.


Assignments for workshop workgroups

This analysis of the shape of best practice makes it possible to offer more focus to the assignments for the three workgroups that will function during the workshop:

Workgroup Tasks
Group I: Principles of Lexical Description
  • Bring feedback concerning anything to change in this roadmap document (especially as it regards principles of lexical description).
  • Begin to document characteristics of best practice descriptive markup (action 1 above).
  • Use the samples of lexical markup brought by group members to explore the problem of metadescription of lexical markup. Develop some possible metaschemas to describe the markup in these examples (a step toward action 9 above).
Group II: Markup of Lexical Entries (emphasis on ontological concepts)
  • Bring feedback concerning anything to change in this roadmap document (especially as it regards ontology issues).
  • Go through the existing markup proposals and lists of elements that need to be accounted for in markup to place them in the existing EMELD ontology of linguistic concepts (action 8 above).
  • Use the lexicon samples brought by group members as a means of checking the coverage of the ontology. Identify every element of content in the lexical entries and check that each is accounted for in the ontology.
Group III: Lexicon Macrostructure
  • Bring feedback concerning anything to change in this roadmap document (especially as it regards macrostructure issues).
  • Identify concepts of lexical macrostructure that need to be included in the ontology of lexical description (a contribution to action 8 above).
  • Review the lexicon/* terms in the OLAC Linguistic Data Type vocabulary and suggest improvements (including deletions, additions, changes). In particular, tighten up definitions of the types in terms of the elements of microstructure (e.g., concepts from ontology) that characterize them (action 6 above).