Gene Deal, University of Utah

Computational Orthographic Transcription

Motivation Differing orthographies within some native languages deny native speakers of an otherwise mutually intelligible language the ability to share simple resources. A computational solution to this problem allows digital resources to be transcribed as needed for use by teachers or speakers of the language. This solution has the benefit of minimizing or avoiding political issues involved in standardizing orthographies and provides an alternative to more expensive solutions.
Problem and Approach Providing an effective computational solution is dependent upon two key components: the transcription algorithm and a trancription rule set.
The transcription algorithm extracts character clusters from each word beginning with the largest possible cluster for comparison to the transcription rule set. If the comparison results in a match, the corresponding transcription is added to an output string. If a match is not found the cluster length is reduced by one and the comparison is repeated. When the cluster length is reduced to a single character and a match is not found, then that character is added to the output string. The algoritm insures that every possible cluster within each word is extracted and analyzed.
The transcription rules are determined by comparing cognates from two orthographies and establishing correspondence sets. These correspondences result in character cluster pairings (rules) such as /p>b/ or /ihnu>ihyu/. Once established, rules are sorted by length to faciltate efficient comparison within the algorithm. The scope of the rule set determines the accuracy of the transcription.
Results Test results using Shoshoni orthographies show a high level of accuracy and encourage further research with more complex orthographies. Ambiquities between orthographies are reflected in the transcription. In the Shoshoni tests most of these ambiquities were resolved by expanding the scope of the distributions.
Conclusion Two key issues remain to be addressed. First, the rule building procedure requires simplification to make the transcription process more accessable to non-linguists. Automating this process would greatly increase accessability by potential users. Second, the beta-type lacks an intuitive user interface which would further increases accessability. Computational transcription shows promise as a solution to the multiple orthography problem. As a tool it can greatly reduce the labor involved in transcription.