Biao Min Case Study:
From Notecards to the Web

Page Index




The content of this page was developed from the data of Dr. David Solnit , in consultation with Dr. Martha Ratliff.

Introduction

For four months in 1982, while at the Central Nationalities Institute under a Graduate Program Fellowship, David Solnit collected all the existing field data on Biao Min. His collaborator and informant was a research fellow affiliated with the institute, Mr. Deng Fanggui ( 鄧 方 貴 ). Mr. Deng, a native speaker of Biao Min, was 52 years old at the time and was from the village of swəl:7 lyɔŋ2 ( 雙 龍 ) in the Quanzhou County ( 全 州 ) of the Guangxi Province. David Solnit wrote down the Biao Min data on notecards, one word per card. Shown below are three examples of David Solnit's actual notecards. To view an enlarged version of a notecard, click on the image.

Notecard 1 Notecard 2 Notecard 3

More on digitizing images


These notecards were then put aside, in a closet, for over a decade. In 2001, Dr. David Solnit donated them to the E-MELD project. With the assistance of Dr. Martha Ratliff, the graduate research assistants of the Linguist List have been digitizing the Biao Min notecards.

Image digitization

Because this is the only documentation of Biao Min known to exist, simply entering the information on the notecards into a database is not sufficient. In digitizing documentation of endangered languages for long-term preservation, researchers must ask themselves, "Will future linguists find it valuable to have access to images of the notes themselves?" There are cases in which marks in field notes have led later analysts to a reinterpretation of the data. For this reason, it is important to create and archive digital images of the notecards.

The process of image digitization requires making decisions on scanning parameters and storage formats. Three different types of image formats may be created in the process of digitization: the archival format (master copy), the presentation or access format, and a thumbnail version of the image.

Archival format

The Biao Min notecards are being scanned with the following settings:

These decisions were based on the importance of clearly understanding the intellectual content of what was written on the cards, to serve the goal of preserving the lexicon of an endangered language. If we had been scanning other materials, such as artwork, or early versions of a rare and unusual orthography, or index cards annotated in differently colored inks, we might have chosen color and a higher resolution. Conversely, when scanning printed or typed text, a lower resolution and bitonal (1-bit) images may suffice. Scanning parameters need to vary according to the characteristics of the objects being scanned.

The resulting master images are being archived in TIFF format. TIFF format is the optimal format for storage; it is an uncompressed format, which does not lose information.

More on archival image formats

Presentation format

TIFF images are extremely large (approximately one megabyte per single image). Large files take a long time to download to a PC, making them impractical for presentation on the web. Therefore, copies of the Biao Min notecard images are being compressed in GIF format for display purposes. GIF uses a lossless compression algorithm and a limited (8-bit) color palette to reduce file size. To view the difference between these formats, click on the thumbnail images below.

1) GIF image of a Biao Min notecard


Biao Min Notecard GIF Format
Image Details: Width: 1484 pixels ; Height: 878 pixels; Bit Depth: 8 bits per pixel; Color Representation: Palettized; Compression: Lempel-Ziv; Size: 580 KB


2) TIFF image of a Biao Min notecard


Biao Min TIFF Notecard
Image Details: Width: 1480 pixels; Height: 876 pixels; Bit Depth: 8 bits per pixel; Color Representation: Palettized; Compression: Lempel-Ziv; Size: 614 KB

More on presentation image formats

Thumbnails

Thumbnails are frequently created for presentation. Thumbnails are usually GIF images that have been reduced in size, making it possible to display small, clickable versions of images on a single webpage. They are especially useful for accessing images that are difficult to describe in words, such as artwork or photographs. However, the most logical way to access the Biao Min notecards will be by linking them to the entries in the FIELD lexcical database discussed below. Therefore, although we have created thumbnails for a few Biao Min notecards to be viewed on these pages, it is unlikely that we will produce thumbnails for each of the thousands of cards in the lexicon.

More on digitizing images

OCR or keyboard entry?

Creating digital images of the cards was the first step; the second was finding a way to preserve the textual information. There are two ways to digitize text: Type it in, or run an OCR application to convert the images into characters. Unfortunately, OCR works relatively well for printed or typed text, but is not yet available for handwritten notes. Therefore, the Biao Min lexicon would need to be entered into some sort of database.

More on OCR vs. keyboard entry

Text digitization

Since OCR is not suitable for handwritten documents, the research assistants of Linguist List began the time-consuming task of manually entering all of the data from David Solnit's notecards into a database using the FIELD tool. FIELD has been developed specifically for entry of lexical data in best practice format; it is Unicode-compliant, and has the ability to output the data as an XML document.

More on entering lexical data

Text storage

XML stands for eXtensible Markup Language. It defines a standard way of encoding the structure of information in plain text format. It is an open standard of the World Wide Web Consortium that is based on extensible tags (extensible meaning that they are not pre-programmed, but can be defined by the creator). XML is currently considered best practice for the archival encoding of textual data, because it does not depend upon any particular software, and can be formatted through an XSL Stylesheet to be displayed in almost any format. Furthermore, it is generally more self-descriptive than other electronic formats, which should make it more accessible to future generations.

More on XML

Text presentation

Stylesheets can be used to transform XML documents into different file formats (for instance, HTML, text, or PDF). Using an XSL processor, it is possible to transform an XML document via multiple XSL stylesheets which will display the information in multiple formats without changing the original XML document. Thus, a stylesheet could transform the same lexicon in XML into a learner's dictionary or an academic dictionary, in online or printed versions.

More on stylesheets

Metadata creation

Metadata is information about resources. In this case, it is information about language resources: lexicons, audiotapes, transcribed texts, language descriptions, video recordings, etc. It is similar to card catalog information about library resources -- it enables discovery and retrieval of resources through standardized information.

More on metadata

 

Follow the digitization path of the Biao Min data:

  1. Get Started: Summary of Biao-Min Conversion
  2. Digitize Images: Digitizing Images page (Classroom)
  3. OCR or Keyboard Entry: OCR or Keyboard page (Classroom)
  4. Digitize Text: Lexical Analysis page (Workroom)
  5. Store Text: XML page (Classroom)
  6. Present Text: Stylesheets page (Classroom)
  7. Create Metadata: Metadata page (Classroom)

User Contributed Notes
From Notecards to the Web: Biao Min
+ Add a comment
  + View comments

Back to top Credits | Glossary | Help | Navigation | Site Map | Site Search