The Electronic Encoding of Text Resources:
A Roadmap to Best Practice
Gary F. Simons
SIL International
EMELD Workshop on Digitizing and Annotating Texts and Field Recordings
11-13 July 2003, E. Lansing, MI

Introduction  
The EMELD proposal (section 3.1) lists the following among the objectives of the project:

  • Formulation and promulgation of best practice in:
    • Linguistic markup of texts and lexicon
    • The creation of metadata for language resources
One of the goals of this workshop is to begin the process of developing best practice recommendations for the markup and metadata description of annotated texts and field recordings. The purpose of this document is to lay out a roadmap for what needs to be developed in order to meet these objectives of the project. This is done by proposing requirements for the eventual solution and then enumerating consequent features of its implementation. But first I begin with some background definitions.

 
Definitions  
To create electronically encoded resources we will use markup languages. One conclusion of the first EMELD workshop in 2001 was that EMELD would recommend markup based on XML, the Extensible Markup Language—an information interchange standard of the World Wide Web Consortium (W3C 2000). Those unfamiliar with XML are referred to the Text Encoding Initiative's "Gentle Introduction to XML" (Sperberg-McQueen and Burnard 2001).

A markup language, like a natural language, has a lexicon, syntax, and semantics. The following terms are used throughout this paper to refer to the descriptive artifacts that document these three aspects of markup:

markup vocabulary
Enumerates the lexical inventory of markup: i.e., the set of elements and attributes that are used in marking up a resource. (In practice, the vocabulary is enumerated within the markup schema rather than in a separate document.)


markup schema
Specifies the syntax of markup: i.e., a formal grammar defining constraints on where elements and attributes must or may occur with respect to embedding and relative order. (This is typically realized in an XML DTD or an XML Schema, though other mechanisms are emerging.)


markup metaschema
Specifies the semantics of markup: i.e., a formal mapping from elements and attributes to the linguistic concepts they represent. (This area of markup is not as well developed as the syntactic area, but is beginning to be developed under the impetus of the so-called Semantic Web; see W3C 2002.)
 
Requirements  
In this presentation of requirements the individual requirements are set apart as numbered statements in order to facilitate discussion. Similarly, the consequent features are set out as subordinate statements that bear an identification letter, as in:

  1. A requirement from the linguist's point of view
    1. A consequent feature of the implementation
    2. Another feature of the implementation
The first requirement deals with the need for longevity of access far into the future. This aspect of language documentation and description is covered in detail in Bird and Simons (2002); only a few key points are noted here:

  1. Language resources, especially those describing endangered languages, need to be accessible by any interested party long into the future.
    1. The archival form of electronically encoded resources should not be in a proprietary, binary format, since the need for proprietary software limits the audience who can access the resources and such formats are likely to become obsolete and inaccessible within a few years.
    2. The archival form of electronically encoded resources should not be an interactive Web application, since upgrades to hardware and system software typically cause these to cease to function within a few years.
    3. The archival form of electronically encoded textual resources should be based on clear text formats that can be read with any text editor and many other tools.
    4. The complexity of annotated texts is such that XML is the format of choice for meeting these requirements.
Microsoft Word documents provide an example of a proprietary, binary format that is not acceptable for long-term preservation of information. Plain text documents formatted with line breaks and spaces are an example of a format that meets requirements a through c; so are tab- or comma-delimited representations of spreadsheets or data tables. But most language resources have a more complex structure involving hierarchy and cross-reference, thus a more sophisticated representation is needed. Markup based on the XML standard meets all the above requirements and is now supported by such a wide variety of tools (both open and proprietary) that it has become the clear choice for archival formats. But what should the nature of the markup vocabulary be?
  1. Linguists need to be able to do more than just read language resources in display format; they also need to be able to manipulate the content by selectively accessing individual items of information.
    1. The archival form of electronically encoded resources should not follow a strategy of presentational markup; that is, the markup vocabulary should not be one that simply identifies what the information will look like when displayed.
    2. The archival form of electronically encoded language resources should follow a strategy of descriptive markup; that is, the markup vocabulary should identify what the individual pieces of information are from a linguistic point of view.
    3. The markup vocabulary for a particular language resource should identify all of the elements of information that are contained within it, not just some of them.
    4. Users still need a presentational display of the resource; this should be accomplished by applying a stylesheet to the descriptively marked up resource.
HTML markup, when applied to language resources, is an example of presentational markup. Though it does have the features of longevity needed for an archival format, it does not offer linguists the ability to do automated processing of a linguistic nature, such as to answer the query "What are the part-of-speech categories used in tagging this text?" For this purpose a markup vocabulary that specifically identifies the linguistic significance of each piece of information is needed. But simply having a markup vocabulary is not enough; each marked-up resource also has a grammar that defines how the individual markup elements may combine to form valid resource.

  1. The linguist creating a language resource needs for the markup of the resource to be consistent with his or her plan for its content and structure.
    1. Best practice requires the use of a markup schema (such as an XML DTD or an XML schema) to validate a given resource as conforming to its plan for the content and structure.
    2. A single markup schema that sanctions all common practices in structuring the content of a particular kind of resource will be too permissive to constrain any single resource to the specific plan of its creator. (See Simons 1998 for a discussion of this point with respect to the TEI DTD for print dictionaries.)
    3. There is enough convergence of practice that it will be possible to develop one or more specific markup schemas that can be recommended for widespread use while being adequately constraining.
    4. There will always be plans for content and structure that are unique enough to require that a unique markup schema be devised for the resource. (This is one of the conclusions of the 2001 EMELD workshop.)
These consequences of requirement 3 thus mean that there will be multiple markup schemas, even in the context of best practice. In order to achieve interoperability of resources when there are multiple markup schemes we will need to introduce a meta-level in our approach to markup:

  1. Linguists need to be able to query and otherwise manipulate multiple language resources in a single operation, even though they may individually have different markup vocabularies and schemas.
    1. As a foundation for interoperability, there must be a shared ontology (Langendoen and others 2002) for the kinds of information that are marked up in language resources.
    2. As the bridge to interoperability, each resource must have a metaschema that formally documents how the elements and attributes of its markup schema map onto the concepts of the common ontology.
    3. The metaschema must be separate from the language resource (rather than being an integral part of it) so that multiple resources can share the same metaschema.
    4. It must be possible for a third party to create a metaschema for a resource that lacks one without changing the resource itself. (This implies that the linkage from metaschema to schema to resource is specified through metadata.)
Finally, it is not enough that electronically encoded resources are created. They must also be found and used by others long into the future. This implies a final set of consequences having to do with archiving.

  1. Linguists, educators, speakers of the language, and any other interested citizens of the world need to be able to find and use electronically encoded language resources long into the future.
    1. Electronically-encoded language resources (with associated schemas and metaschemas) must be deposited into archives that can guarantee their long-term preservation and access.
    2. In order to make it possible for potential users of the resource to discover that the resource exists, a metadata description of it needs to be written and published in a searchable catalog of worldwide language resources.
    3. In order to help potential users of the resource judge the relevance of the resource, the metadata description needs to include information like identification of the specific language and a characterization of what type of resource it is from a linguistic point of view.
The Open Language Archives Community is already in place with an infrastructure that meets these needs, and EMELD will build on this infrastructure.


 
The shape of best practice  
Taken together, the above requirements and the consequent features of implementation suggest the following shape for best practice with respect to the markup of texts and lexicons:

  Best Practice for Resource Creation What the Community Must Do to Support Best Practice
Language documentation and description Archive resource as an XML document that is valid with respect to a descriptive markup schema that is supplied with the resource. 1. Document characteristics of best practice descriptive markup.
2. Recommend one or more markup schemas that meet these characteristics.
3. Develop stylesheets that do presentational rendering of resources that conform to these schemas.
Metadescription for resource discovery Provide OLAC metadata for the resource and deposit it with an OLAC data provider. 4. Define the OLAC metadata standard.
5. Define the controlled vocabulary for identifying language resource types in a refinement of the Dublin Core <type> element.
6. Develop a community service for resource discovery.
Metadescription for resource interoperation Provide a metaschema for the resource. 7. Define a common ontology of the concepts of language description.
8. Define the markup schema for a metaschema.
9. Develop metaschemas for the schemas recommended in point 2 above.
10. Develop a community service that uses metaschemas to provide interoperation across multiple language resources.


When the 10 community action steps listed in the last column have been completed, the "formulation" part of the EMELD objectives listed at the outset of this paper will have been met. The "promulgation" part will require additional work in areas like documentation, dissemination, and training.

 
References  
Bird, Steven and Gary Simons, 2002. Seven Dimensions of Portability for Language Documentation and Description, Proceedings of the Workshop on Portability Issues in Human Language Technologies, Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands. Available at: http://arxiv.org/abs/cs/0204020. Revised version: http://www.ldc.upenn.edu/sb/home/papers/0204020/0204020-revised.pdf

Langendoen, D. Terence and others, 2002. Publications of the EMELD Arizona group. Available at: http://emeld.douglass.arizona.edu:8080/group.html.

Simons, Gary F., 1998. Using architectural processing to derive small, problem-specific XML applications from large, widely-used SGML applications, SIL Electronic Working Papers 1998-006. Available at: http://www.sil.org/silewp/1998/006/.

Sperberg-McQueen, C. M. and Lou Burnard, 2001. A Gentle Introduction to XML. Chapter 2 of TEI P4: Guidelines for Electronic Text Encoding and Interchange, XML-compatible edition. TEI Consortium. Available at: http://www.tei-c.org/P4X/SG.html

W3C, 2000. Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation 6 October 2000. Available at: http://www.w3.org/TR/REC-xml.

W3C, 2002. The Semantic Web, an activity of the World Wide Web Consortium. Home page: http://www.w3.org/2001/sw/.

 



Program Readings Participants
Instructions for Participants
Workshop Homepage
Registration
Local Arrangements
Emeld 2001 Emeld 2002 Emeld Homepage