From Shoebox to the Web: Mocoví
- Present Text
- Store Text
- Encode Text
- Implement Multi-Media
- Follow the digitization path of the Mocoví data
Dr. Verónica Grondona.
Dr. Verónica Grondona collected the data on Mocoví during a combined period of six months between 1991 and 2003 in Colonia "El Pastoril," Villa Angela, Chaco, Argentina. She worked mainly with three native speakers of the language, all of whom spoke Spanish as a second language: Juan José Manito, born in 1943 and 'president' of the community between 1991 and 1993; Roberto Ruiz, born in 1939; and Valentín Salteño, born in 1966. Mr. Ruiz and Mr. Salteño are both 'bilingual helpers' at the local elementary school.
The data Professor Grondona collected makes up a database of about 3000 items, 300 sentences, and 8 interlinearized texts by various native speakers. The recordings each have a duration of about five minutes. Approximately 3,500 of these lexical items are represented in the Mocoví lexicon.
Linguists must consider the various uses of their data and prepare various formats accordingly. The most important format is the archival format, since this highly portable version will be the most enduring and therefore the most useful to future generations. However, linguists must also consider the short-term presentation of their data.
Stylesheets can be used to transform archival XML documents into different file formats (for instance, HTML, text, or PDF). Using an XSL processor, it is possible to transform an XML document via multiple XSL stylesheets which will display the information in multiple formats without changing the original XML document. Thus, a stylesheet could transform the same lexicon in XML into a learner's dictionary or an academic dictionary, in online or printed versions.
In order to create a presentation document comparing word forms in Mocoví to those of nearby (potentially related) languages, E-MELD used a stylesheet. To view the resulting document, access the Mocoví comparison word list.
The Mocoví data presented unique problems for storage and retrieval. Professor Grondona's original data was in a decade-old version of the Shoebox program, which is in a now obsolete DOS format. Shoebox is supported by SIL and is a program much favored by linguists because of its flexibility. However, Shoebox formats become obsolete quickly and are therefore not in accordance with the recommendations of best practices. Furthermore, it can be difficult to convert Shoebox data into recommended formats that have greater potential for long-term intelligibility.
XML (eXtensible Markup Language) is the preferred format for long-term intelligibility. XML is a standard way of encoding the structure of information in plain text format and is an open standard of the World Wide Web Consortium. XML is based on extensible tags; the tags are not pre-programmed, but can be defined by the creator. Because it does not depend upon any particular software and can be formatted through an XSL Stylesheet to be displayed in almost any format, XML is in accord with the recommendations of best practices for the archival encoding of textual data. Furthermore, it is generally more self-descriptive than other electronic formats, which should make it more accessible to future generations.
Professor Grondona input her data into Shoebox using the IPA Kiel font, which encodes IPA symbols in the same ASCII code-points used for other characters. To ensure long-term intelligibility of the documentation, it was necessary to convert it to Unicode, which has one unique code-point for each character.
However, direct conversion was not possible. E-MELD wrote a program to change the IPA Kiel upper-ASCII characters to special characters, such as ##, then created a second program to convert those characters into the Unicode characters that were originally intended.
To save time in the field, Grondona also used her own system of "shortcuts" to represent phonetic characters when typing in a non-IPA font. Researchers often use their own character replacements, either to save time or because the font they are using does not contain all needed characters. For example, Professor Grondona used the symbol 7 to represent the IPA symbol for a glottal stop, ʔ. This practice led to problems in migrating the data into the FIELD database; to avoid mistakes with ambiguous characters (i.e., did a "7" represent a glottal stop or a numeral?) each alternative symbol in the non-IPA font had to be replaced by hand.
Guidelines for these replacements were created to streamline the process. For the Mocoví data, vowels with accents (á,é,í,ó,ú) were in need of particular attention because they might indicate missing symbols. Character replacements were executed as follows:
|ɣ replaces as x||7 replaces as ʔ|
|lÿ replaces as ʎ||æ replaces as ə|
|sh replaces as ʃ (for mocoví)||ch replaces as ʧ (for Mocoví)|
|j replaces as ʝ (for Mocoví)||B replaces as β (for Mocoví)|
|E replaces as ɛ (for Mocoví)|
E-MELD research assistants are in the process of digitizing the audio tapes that Dr. Grondona collected in the field. After the tapes have been digitized, using Sound Forge software, they will be aligned and annotated. Audio data is notoriously ephemeral because even digital formats deteriorate within a few decades. Common practice involves copying analog tapes onto fresh analog tapes to preserve the integrity of the data; this repeated copying, however, compromises the data's sound quality.
Best practices recommend that analog tapes be converted to digital formats, although these files will still require periodic migration to new physical media and new softwares. The data should be stored in archival format, although alternative formats may be desired in addition to the archival format for presentation purposes.
To ensure its long-term intelligibility, the camcorder footage that Dr. Grondona collected in the field is being converted to MPEG and is being annotated using Open Source annotation software. Video data stored on magnetic tapes is vulnerable to decay, and even footage captured using digital cameras is initially stored on magnetic tapes. This data must be converted to a more enduring format.
Best practices recommend that video tapes be converted to uncompressed digital formats. This will create an archival copy of the data, although a compressed presentation copy might also be desirable. Video is an especially complex multimedia format, requiring the synchronization of a series of still images with a soundtrack. Care must be taken in the conversion of this type of data to ensure accuracy.
- Get Started Summary of Mocoví conversion
- Present Text: Stylesheets page (Classroom)
- Store Text: XML page (Classroom)
- Encode Text: Unicode page (Classroom)
- Implement Audio: Audio page (Classroom)
- Implement Video: Video page (Classroom)
|About the Data|
Search the Lexicon
|About the Language|
Language and Society
|User Contributed Notes
E-MELD School of Best Practices: From Shoebox Files to the Web: Mocoví
|+ Add a comment|
|+ View comments|