"The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies . . . . While these technologies greatly enhance our ability to create digital data, their uncritical adoption has compromised our ability to preserve this data. The new digital language resources . . . . are difficult to reuse and less portable than the conventional printed resources they replace."

Why Best Practice?

Linguists have been quick to see the advantages of digitizing language data and making it available over the web. For one thing, Internet technology can foster discussion and improve analytical accuracy by allowing others to see and hear the actual data on which an analysis or description is based. Second, digital copies in distributed archives are potentially our best means of preserving irreplacable language data. But the multiplicity of digital formats and tools has made it very difficult for others to find, access, and re-use our work.

Best practices in the digitization of language data and documentation are designed to insure that digital language resources are as independent of computer environments, scholarly communities, and domains of application as we can make them. Equally important is another goal: to make digital material endure through time. The implementation of best practice recommendations should insure that your digital language documentation and description can be re-used by others, both now and in the future.

However, best practices in language digitization are a matter for the linguistic community to decide,
Best Practice in a Nutshell

  • Make an archive copy in plain text format.
  • Use XML markup for the archive copy.
  • Use Unicode for character encoding.
  • Create metadata for the resource in a standard format, e.g. OLAC.
  • Make the metadata available to a general search engine, e.g. the OLAC harvester.
  • Use open source software.
  • For language identification, use Ethnologue / OLAC language codes.
both because community expertise is required to draft sound recommendations and because community support is necessary if the recommendations are to succeed in one of their major purposes: making our language documentation portable and re-usable. This purpose can not be achieved without cooperation, since it requires that we adopt practices that make our digital resources maximally compatible with others' data and applications.

The EMELD project, in collaboration with the Open Language Archives Community and other large language digitization projects like DOBES, has been made a start on drafting such recommendations. Displayed on these pages are suggestions that have come out of the EMELD summer workshops. But we do not by any means consider them final. We present them here primarily to solicit your input.

Digitizing Speech Metadata Tools & Software